Google Breaks Down
Editor Deb Donston has done an eye opening editorial in the May 18th 2009 issue of eWeek –” Oh my gosh can we really afford to go to Cloud Computing if even Google can go out for a portion of the day ???’. First there is utter disbelief – this must be the network or some other software. Then there is the confirmation that indeed, it is Google. And then there is the inevitable conclusion. If Google, the Masters of the Cloud, the Wizards of the Super Data Centers Worldwide – if they are vulnerable to outages what hope is there for the Salesforce.coms or Zohos or Amazons of the Cloud world to deliver and maintain uninterrupted service? In short, if Google is vulnerable then when will the whole edifice of Cloud Computing be declared risky and subject to Black Swan events and therefore capable of falling like a House of Cards as did the Wall Street Masters of the Financial Universe, whose spasms of internecine greed brought about the near collapse of Financial Markets throughout all of 2008 in ugly slow motion. Is Google’s more frequent and more widespread outages indicative of the inherent risks in Cloud Computing? Yes, but not just because of Google or any other Cloud Computing provider.
The problem is three sources of Risk for the Cloud – 1)Internal Provider Risk where their systems fail because of failures to scale, Black Swan catastrophic internal events including marginal programmatic bugs, and uncontrolled change events; 2)External Natural Risk where systems fail due to natural calamities such as earthquakes, floods, tornadoes, hurricanes, ice storms, solar flares and other massive climate, astronomic, or geologic occurrences wiping out key Cloud Nodes; and 3)External Perverse Risk where agents unleash fire, water, bombs, programmatic worms, denial of service attacks, and other deliberate destructive vectors on key facilities or systems that support Cloud Operations and Services. Because Cloud Systems are distributed over so many network links, data nodes, and processing points, their Vulnerability Profile and MTTR-Mean Time To Repair are subject to possible extreme compromise.
This is somewhat comparable to my backup system. I currently backup 3 systems to the same hard drive. I know that 3 systems depend on the MTBF-Mean Time Between Failures of a SATA hard drive whose MTBF is 100K. Once a month I back up the back-up drive to a second drive. So my average exposure is half a months work. What I really should be doing is backing up to a RAID drive system which in turn is backed up to a second raid drive. This would considerably reduce my risk of loss exposure. But time, cost, and other factors have prevented that from happening. But my Risk Reduction scheme is trivial in comparison to the Cloud.
Backup in the Cloud
Remember the vaunted strength of the Cloud is that the network constantly adapts to outages; but that strength is also a vulnerability particularly to specific systems. For example, say 75% of the systems affected by a Cloud Outage are able to re-route themselves around an outage within milliseconds; another 8% within seconds, another 7% within minutes but the remaining 10% are spread out over hours to days of downtime. If yours is one of the last ten percent, your operations are in for lots of workaround pain depending on what Cloud Backup has been provided. But that very Cloud Backup is subject to its own complicated risks. Cloud Backup is not as simple as my or your personal back-up situations.
The Cloud is constantly changing – new nodes and links come online, change capacity, and disappear. This means that critical resource that your Cloud Backup System may be depending on constantly change in 2 ways: The reources that your systems use that have to be backed up, they change; and the resources that the Cloud Backup System itself uses are also subject to change. This increases the Vulnerability Profile for Cloud System Backup and Recovery. In essence Wall Street’s Financial Fiasco is most instructive. In effect, our financial system was depending on complex financial instruments and derivatives values in $10’s of trillions of dollars which in many cases were not stress tested at all for unlikely but still possible events and occurences and b) were hard to understand and value but by a very few – and even among these elites there may not be total agreement. Hence the Financial Meltdown aided and abated by a liberal seasoning of greed. The Cloud Computing equivalent would likely be less pervasive, but for the 10% of victims of a massive Cloud outage, the repercussions could be just ast catastrophic.
Offline as Backup
However, Cloud Computing has another potential backup provision. The enabling of online/offline operations for online systems. Think in terms of Google Gears, Adobe Air, Lotus Notes and other systems that are able to enable systems to run both online and offline with the same basic functionality and features. This has the advanatge that when users cannot connect to the net or locally, they still can be productive and then reconcile to the other system when athe connection is remade. At this time the data is resynched between the two systems and there are some classic refresh/reconcilaition algorithms being supplemented by new Operational Tranformation methods (see Google’s Wave for example). So the opportunity to provide RAID-like backup for systems will be available to Cloud Providers and Users. The one problem is that adoption and refinement of Google Gears, Adobe Air, Lotus Notes and other synchronizing systems has been mixed at best. Perhaps, the Deb Donston Google Outage event will direct more attention to both Cloud Back up mechanisms.As an avowed desktop user I look forward to the full flourishing of online plus offline mode of operations for more systems.