Data Management
Data management is an important area of Grid and Cloud computing research. It is concerned with the storage, access, translation and integration of data. It hopes to answer questions like:
- Where should I put my data?
- How should I get to it?
- How do I present my data in a way others will understand?
- How can I combine data from different places?
Requirements
We have identified several typical data management problems encountered by businesses:
- Fast transfer of large files
- Accessing data from different locations
- Accessing heterogeneous data
- Replication of data for speed and robustness
- Federate a number of data sources
These requirements are clearly very broad in scope and have been addressed to varying degrees by existing middleware. They all depend on what sort of data is involved and what will be done to the data. There is definitely room to improve the solutions currently available. There is also scope to extend them. As with any area of new technology, there are often many systems, each of which can provide part of a solution. Though a solution may exist for each sub-problem, it is not always possible to use one technology to produce the desired result.
Common Capabilities
From these requirements, we derived the following "common capabilities" or descriptions of architectures that would satisfy these requirements:
- Avoid Transferring Data
- Access to a Remote Data Source
- Homogenise Data Sources
- Synchronise Multiple Data Sources
- Treat Multiple Data Sources as One
The foundation common capability that many of the others relate to and can use is Access to a Remote Data Source. Many of the reasons that data is heterogeneous go hand in hand with data being held in different places, though the two are distinct problems. Being able to homogenise data is only really useful if you can use the homogenised data to gain some benefit. Often this involves pulling data from multiple places, something very fundamental to Grid computing. Similarly synchronising data is often done for disaster recovery and fault tolerance. This means that the safest place to have disparate data sources is on physical distant machines and this means accessing remote data sources. Federation is another example that makes a lot of sense split across multiple systems. Combining access to remote systems with homogenising data access methods can allow some powerful federations to be built.
The problem becomes even worse if you want to replicate write access across multiple data sources in which case all data sources in effect become masters (they communicate their updates to other data sources) and replicas (they reflect the changes made to other data sources). In this case, it makes sense to split up your data into distinct chunks that only one data source can alter. For example for certain applications it may be appropriate to partition data into distinct geographical sections.
Design Patterns
From these capabilities, we then created the following Design Patterns, which can be used to design components implementing (parts of) these capabilities:
Components
IT-tude.com hosts the following components which relate to these patterns, capabilities and requirements:
- OGSA-DAI Trigger Mechanism
- Provides a mechanism to notify an OGSA-DAI server when a table in an SQL database has had its rows modified. An example use of this is to keep multiple heterogeneous databases synchronised.
- Design Patterns: Primary-Secondary Replicator
- Common Capabilities: Access to a Remote Data Source, Synchronise Multiple Data Sources
- Requirements: Accessing Data from Different Locations, Replication of Data for Speed and Robustness
- OGSA-DAI Data Publisher
- A GUI-based application for installing and configuring an OGSA-DAI server and the data resources it exposes.
- Design Patterns: Data Source Publisher
- Common Capabilities: Access to a Remote Data Source
- Requirements: Accessing Data from Different Locations
- OGSA-DAI JDBC Driver
- A JDBC driver for OGSA-DAI which acts as an interface between OGSA-DAI and the client application, supporting SQL queries and updates.
- Design Patterns: Data Federation Pattern
- Common Capabilities: Access to a Remote Data Source, Homogenise Data Sources, Treat Multiple Data Sources as One
- Requirements: Accessing Heterogeneous Data, Federate a Number of Data Sources
- OGSA-DAI Query Translator
- Takes a generic database query and translates it into queries specific to given data sources, and also translates the results from each data source back into a common format.
- Design Patterns: Query Translator Pattern
- Common Capabilities: Homogenise Data Sources, Treat Multiple Data Sources as One
- Requirements: Accessing Heterogeneous Data
White Paper
More details on the integration of the data management components in a general framework can be found in the Data Management White Paper









