Jeff is Chief Scientist at Cloudera, which helps enterprises with Hadoop implementations.
Hadoop consists of three separate modules, which are apparently in the process of being split into separate Apache projects:
- Hadoop Distributed File System (HDFS)
- MapReduce
- Common (aka Hadoop Core)
I’ll just mention some of the interesting little tidbits from the presentation:
- Standard box spec is 1U 2x4core, 8gb ram, 4x1TB SATA 7200rpm.
HDFS:
- Stores 128mb blocks, replicates the block
- Good for large files written once and read many times
- Throughput scales nearly linearly
Some examples of Hadoop-based projects:
- Avro – cross-language data serialization
- HBase – like BigTable
- Hive – SQL interface, an interesting open-source data warehouse solution
- Zookeeper – coordination service for distributed applications
Hadoop @ Yahoo: 16 clusters, each cluster is 2.5PB and 1400 nodes
Cloudera maintains convenient, stable Hadoop packages – it’s all open-source – so you don’t have to go around figuring out what version of what subproject works best with others.
Testing: Hadoop has a standalone mode, which uses a single reducer in one JVM.
Jeff mentioned that they use Facebook’s Scribe for distributed logging.
And last but not least, Cloudera has a GetSatisfaction page.
About the Author