Author Pawel Plaszczak
Pawel also blogs regularly at bigdatamatters.com
In certain sectors that were the early adopters of Grids, migration to the Cloud is bound to happen soon.
Pharmaceuticals is a good example. As Bob Cohen pointed out in a recent presentation:
• Eli Lilly has already tried using the Amazon EC2 external Cloud,
• GlaxoSmithKline is looking at using internal Clouds
As I remember, years ago Glaxo was among the early Grid users. Like many other pharmas, they used
software from UnivaUD to distribute protein docking simulations over a large number of machines. Now UnivaUD also sells Cloud services.
This supposedly common movement from Grid to Cloud is thought provoking. What does this “evolutionary step” really mean? Something conceptually quite simple: The Grid makes it possible to manage processes in many physical machines. The Cloud offers even an greater potential: to manage processes in many virtual machines, or even to manage those virtual machines like they were processes. This is what VMware vSphere offers or what internally powers Amazon EC2. So:
Cloud = set of virtual machines managed by a scheduler (Grid).
All those above named are great products. If you want an internal Cloud. The thing is: in solving large data challenges, the Cloud is no less limited than its predecessor, the Grid. Chris Dagidigian of gridengine.info said at BioIT that he “solved real problems” on the Cloud. This is not surprising (BioTeam once teamed with Univa to demonstrate Grid Engine on AWS), but such a statement needs an explicit remark: problems solvable on the Cloud are still a small subset of the World’s important data processing challenges.
Virtualized Cloud environments are perfectly isolated from each other. If you have one, pray that you only happen to compute tasks that can be domain-decomposed into millions of perfectly independent pieces.
Protein docking, mentioned earlier, is like that: thousands of simulations, one per each set of chemical
compounds, that do not need to communicate. But most apps, in most industry sectors (not just bioinformatics) do not share these characteristics: they require intensive database querying and/or data sharing. Genomics is like that. The Cloud will not help here. Clouds may even make it more difficult. I also agree here with another of Chris’s statements: Cloud data ingest is a pain.
The answer to large data challenges is a puzzle of three pieces:
1. efficient distributed processing
2. efficient data provisioning
3. efficient storage
Workload management engines (Grids, Clouds, you name it) provide the first point. Today’s data intensive apps need the full stack, an efficient integration of (1), (2) and (3). A fully scalable data integration. Where does the challenge lie?
Efficient processing is easy. With many great scheduling vendors, this is not rocket science any more.
Efficient storage is becoming commonplace too, with interesting examples of federated storage distributions. The trick is in the middle layer: an efficient connection between these two. And that is really difficult. There is certainly no universal solution, but we have recently had some successes here.