Kicking the beast, Part 1: Sharing vs. Google’s shared-nothing Hadoop clusters
Why do Facebook, Google and other Internet giants employ a shared-nothing method of storage and computational resources? Simply put, they just didn’t start with a true enterprise environment.
Those companies started as a small cluster of servers (unlike what government agencies now deal with), and it made sense then. But they’ve since grown into having mega data centers without adapting their architectures. It’s not how a true enterprise works, and it’s not efficient — but don’t tell them that.
There’s definitely a fanaticism built around shared-nothing that is tough to break through.
Hadoop and the shared-nothing data center
Hadoop, originally funded by Yahoo!, emerged in 2006 and hit Web-scale capability in 2008. At its core, it’s an open-source MapReduce implementation that has the ability to take a dataset, divide the data, and run it in parallel over multiple nodes. Hadoop applies to nearly any market, from financial modeling to mission outcome forecasting.
Initially, Hadoop was designed to run on a large number of machines that don’t share memory or disks, like the shared-nothing model mentioned above. All processing would be done in self-contained units within the cluster, communicating over a common network but sharing no computing resources. The software breaks large datasets in smaller pieces and spreads it across the different servers. You run a job by querying each of the servers in the cluster, which compile the data and deliver it back to you, leveraging each server’s processing power.
It’s a great design, but doesn’t require shared-nothing to operate.
The problem with white boxes and being stingy
The functionality of shared storage and shared compute cannot be matched by throwing whitebox servers at the problem within shared-nothing architecture. These independent servers present several problems, including management issues, requiring additional data center space, lack of a single point of support for disparate parts and increased costs to get the functionality of tested, name-brand servers. Don’t miss that mention of more data center space. That’s a problem for federal agencies. When you use whitebox servers, you have to have a lot more floor space to rack all these servers that hold their own storage (and more of it, HDFS3 vs. HDFS2) and more servers because of the failure rates of whitebox servers.
In fact, Hadoop data center admins spend most of their time replacing whitebox servers or hard drives. When a drive fails, the entire node does go offline, but the typical Hadoop cluster can still operate. However, that copy of the data and processing power of the failed node isn’t available, impacting the efficiency of the system.