Data Management

Data management is an important area of Grid and Cloud computing research. It is concerned with the storage, access, translation and integration of data. It hopes to answer questions like:

  • Where should I put my data?
  • How should I get to it?
  • How do I present my data in a way others will understand?
  • How can I combine data from different places?

Requirements

We have identified several typical data management problems encountered by businesses:


These requirements are clearly very broad in scope and have been addressed to varying degrees by existing middleware. They all depend on what sort of data is involved and what will be done to the data. There is definitely room to improve the solutions currently available. There is also scope to extend them. As with any area of new technology, there are often many systems, each of which can provide part of a solution. Though a solution may exist for each sub-problem, it is not always possible to use one technology to produce the desired result.

Common Capabilities

From these requirements, we derived the following "common capabilities" or descriptions of architectures that would satisfy these requirements:

The foundation common capability that many of the others relate to and can use is Access to a Remote Data Source. Many of the reasons that data is heterogeneous go hand in hand with data being held in different places, though the two are distinct problems. Being able to homogenise data is only really useful if you can use the homogenised data to gain some benefit. Often this involves pulling data from multiple places, something very fundamental to Grid computing. Similarly synchronising data is often done for disaster recovery and fault tolerance. This means that the safest place to have disparate data sources is on physical distant machines and this means accessing remote data sources. Federation is another example that makes a lot of sense split across multiple systems. Combining access to remote systems with homogenising data access methods can allow some powerful federations to be built.

The problem becomes even worse if you want to replicate write access across multiple data sources in which case all data sources in effect become masters (they communicate their updates to other data sources) and replicas (they reflect the changes made to other data sources). In this case, it makes sense to split up your data into distinct chunks that only one data source can alter. For example for certain applications it may be appropriate to partition data into distinct geographical sections.

Design Patterns

From these capabilities, we then created the following Design Patterns, which can be used to design components implementing (parts of) these capabilities:

Components

IT-tude.com hosts the following components which relate to these patterns, capabilities and requirements:

White Paper

More details on the integration of the data management components in a general framework can be found in the Data Management White Paper