The Hadoop Distributed File System(HDFS) is highly fault-tolerant and was designed to be deployed on commodity hardware. It has a master/slave architecture with a single NameNode acting as master and multiple DataNode as slaves. HDFS relies heavily on the NameNode and thus, its popularly known as the Achilles heel of Hadoop. Besides a Secondary NameNode, HDFS HA and HDFS Federation are ongoing projects addressing this problem.
We wanted a drop-in replacement for HDFS which did not have these problems. A masterless peer-to-peer architecture would resolve the problems related to NameNode failure. To do so, we would need a cluster of nodes and a mechanism to handle replication, distribution, etc within the cluster.
We opted to rely on Cassandra for handling the clusters, data, replication, etc. The reasons being:
Cassandra has a peer-to-peer distributed “ring” architecture that is much more elegant, easy to setup, and maintain. In Cassandra, all nodes are the same; there is no concept of a master node, with all nodes communicating with each other via a gossip protocol. This makes it decentralized and symmetrical (no single point of failure).
It performs blazingly fast writes, can store hundreds of terabytes of data.
It is highly available and offers a schema-free data model.
There is proof of it working efficiently for a similar use case – Cassandra File System(CFS)
We built “SnackFS” – a distributed FIleSystem with Cassandra as the backend storage to be a drop in replacement to HDFS. It is written in Scala and is fully compatible with Hadoop FS Shell.
Using SnackFS is simple, it comes bundled with all the dependencies. It requires an active Cassandra Cluster.
To run hadoop fs commands, instead of
hadoop fs -ls hdfs:///
snackfs fs -ls snackfs:///
Using SnackFS, it is not possible to change replication factor for files individually but you could have multiple namespaces with different replication factors. SnackFS is highly configurable, i.e., the replication factor, block size, sub block size, Cassandra server, consistencyLevel for read/write operations, name of the keyspace and replication strategy can be set in core-site.xml. This helps in optimizing Cassandra and in turn optimizing SnackFS for different applications.
Work on running Map Reduce Jobs is in progress.