The difficult marriage of cloud and data-intensive apps

Author Pawel Plaszczak
Pawel also blogs regularly at bigdatamatters.com

In certain sectors that were the early adopters of Grids, migration to the Cloud is bound to happen soon.
Pharmaceuticals is a good example. As Bob Cohen pointed out in a recent presentation:

• Eli Lilly has already tried using the Amazon EC2 external Cloud,
• GlaxoSmithKline is looking at using internal Clouds

As I remember, years ago Glaxo was among the early Grid users. Like many other pharmas, they used
software from UnivaUD to distribute protein docking simulations over a large number of machines. Now UnivaUD also sells Cloud services.

This supposedly common movement from Grid to Cloud is thought provoking. What does this “evolutionary step” really mean? Something conceptually quite simple: The Grid makes it possible to manage processes in many physical machines. The Cloud offers even an greater potential: to manage processes in many virtual machines, or even to manage those virtual machines like they were processes. This is what VMware vSphere offers or what internally powers Amazon EC2. So:

Cloud = set of virtual machines managed by a scheduler (Grid).

All those above named are great products. If you want an internal Cloud. The thing is: in solving large data challenges, the Cloud is no less limited than its predecessor, the Grid. Chris Dagidigian of gridengine.info said at BioIT that he “solved real problems” on the Cloud. This is not surprising (BioTeam once teamed with Univa to demonstrate Grid Engine on AWS), but such a statement needs an explicit remark: problems solvable on the Cloud are still a small subset of the World’s important data processing challenges.

Virtualized Cloud environments are perfectly isolated from each other. If you have one, pray that you only happen to compute tasks that can be domain-decomposed into millions of perfectly independent pieces.

Protein docking, mentioned earlier, is like that: thousands of simulations, one per each set of chemical
compounds, that do not need to communicate. But most apps, in most industry sectors (not just bioinformatics) do not share these characteristics: they require intensive database querying and/or data sharing. Genomics is like that. The Cloud will not help here. Clouds may even make it more difficult. I also agree here with another of Chris’s statements: Cloud data ingest is a pain.
The answer to large data challenges is a puzzle of three pieces:

1. efficient distributed processing
2. efficient data provisioning
3. efficient storage

Workload management engines (Grids, Clouds, you name it) provide the first point. Today’s data intensive apps need the full stack, an efficient integration of (1), (2) and (3). A fully scalable data integration. Where does the challenge lie?
Efficient processing is easy. With many great scheduling vendors, this is not rocket science any more.
Efficient storage is becoming commonplace too, with interesting examples of federated storage distributions. The trick is in the middle layer: an efficient connection between these two. And that is really difficult. There is certainly no universal solution, but we have recently had some successes here.

Bookmark and Share

Search2,358 Responses to “The difficult marriage of cloud and data-intensive apps”

  1. product says:

    Sources…

    [...]check below, are some totally unrelated websites to ours, however, they are most trustworthy sources that we use[...]……

  2. pita says:

    Sites we Like……

    [...] Every once in a while we choose blogs that we read. Listed below are the latest sites that we choose [...]……

  3. Tumblr article…

    I saw someone talking about this on Tumblr and it linked to…

  4. [...]always a big fan of linking to bloggers that I love but don’t get a lot of link love from[...]……

    [...]just beneath, are numerous totally not related sites to ours, however, they are surely worth going over[...]……

  5. Bus Trips says:

    Tumblr article…

    I saw someone writing about this on Tumblr and it linked to…

  6. [...]The information mentioned in the article are some of the best available [...]……

    [...]below you’ll find the link to some sites that we think you should visit[...]……

  7. Blogs ou should be reading…

    [...]Here is a Great Blog You Might Find Interesting that we Encourage You[...]……

  8. [...] that is the end of this article. Here you’ll find some sites that we think you’ll appreciate, just click the links over[...]……

    [...] Every once in a while we choose blogs that we read. Listed below are the latest sites that we choose [...]……