Hadoop
From GlusterDocumentation
Hadoop - GlusterFS Resource Page
Hadoop (or the "Apache Hadoop software library" as it's officially called) is a framework that allows for the distributed processing of large data sets across clusters of computers using a simple programming model. It is designed to scale up from single servers to thousands of machines, each offering local computation and storage. Rather than rely on hardware to deliver high-avaiability, the library itself is designed to detect and handle failures at the application layer, so delivering a highly-availabile service on top of a cluster of computers, each of which may be prone to failures.
One of the pieces that hampers Hadoop's scale-out capabilities is the storage backend. HDFS is the filesystem used to store Hadoop data, and by default, it needs at least 3 nodes to replicate data. HDFS also has some limitations when it comes to the amount of storage and the total number of storage nodes it can utilize.
With the GlusterFS connector, not only can it scale-out beyond HDFS, it also can reduce the number of nodes. GlusterFS replication can happen on just 2 nodes as a minimum, just as HDFS can be configured to replicate to only 2 nodes.
The Gluster Connector for Hadoop will be available in GlusterFS 3.3 beta.
Resources
- Apache Hadoop project home
- Community Q&A for GlusterFS Betas and Hadoop
- Download GlusterFS 3.3 with the Hadoop connector
- GlusterFS 3.3 Beta Resource Page


