DiGS

Overview

DiGS is a distributed-data management system that combines commodity storage resources - such as RAID systems and Storage Area Networks - into a large-scale, unified file repository, which is presented to the end-user through an easy-to-use, lightweight client toolkit.

The DiGS application is built on top of the Globus Toolkit.

The key features of DiGS are:

  • capacity for multi-Terabyte, distributed storage
  • robustness with automatic and transparent replication of data across multiple sites
  • continuous data validation and consistency checking
  • support for bulk data transport operations
  • security based on standards-compliant techniques for authentication and authorisation
  • capability for data provenance with application-specific metadata
  • simple and intuitive client tools

More information on the open source DiGS software can be found at http://www2.epcc.ed.ac.uk/~digs/

High Level Architecture

The diagram shows the architecture of a typical DiGS deployment.

The Storage Element provides disk-based storage space to the data grid for holding copies of user data.

The Control Node hosts a persistent agent (commonly referred to as the Control Thread) that handles user requests for datasets, mapping data identifiers to the locations where the data is available. The locations can change over time and the current status is maintained in the File Catalogue. The Control Node also continuously tests and validates the integrity and availability of the contents of the data grid.

The Backup Node provides a subset of the functionality of the Control Node, permitting users to retrieve data from the grid, in the event of a failure of the Control Node.

Usage Scenario

DiGS should be considered for any project which requires fault-tolerant, secure sharing of data across several distributed sites. For example, DiGS could have been used to provide the framework for the FilmGrid project. The FilmGrid project centred around sharing data between geographically distributed organizations involved in a film post-production. Post-productions can often involve many terabytes of data shared between companies spread across Europe and the USA. DiGS would be able to deal with the enormous amounts of data and strict requirements for security involved in a film production.

The original use case that has driven the development of DiGS is from Computational Particle Physics. Since 2002, the DiGS team have been working with the UK computational particle physics (CPP) community to develop the current version of the application.

Alongside the development of the software, the DiGS team has worked with CPP to develop international agreed standards for the acquisition, curation, and analysis of the raw data. The result of this effort is the DiGS-powered, UKQCD Data Grid which:

  • provides ~100 Terabytes of storage space,
  • spans 8 institutions in the UK and USA, and
  • hosts around 60,000 datasets.

See the GridPP website for more information.

In general, DiGS should be considered for any project with distributed storage requirements, especially where robustness and security are of importance.

Dependencies

Each server the software is installed on must have:

  • Globus 4.0 or greater
  • X509 certificates
  • Java 1.4 or greater

In addition, the Control Node and Backup Node must have:

  • An XMLDB compliant database (such as eXist)

Interface

DiGS is accessible via a generic command-line interface and a Java based graphical client. The graphical client is currently a technology preview and not installed by default. The client is shown in the following screenshot.

A video of the GUI in action is also available.

Further Details

More information on DiGS can be found on the DiGS website and the NeSCForge project page.