Hadoop 3.0 Erasure Coding Explained

This deep dive article on “Hadoop 3.0 Erasure coding explained” will highlight how the erasure coding will help reducing 50% of storage overhead cost. The storage component (HDFS) of Hadoop 3.0 by default replicates each block 3 times (and could be higher based on configuration). Replication facilitates a simple and robust form of redundancy to protect against failure cases. It also eases scheduling compute tasks on locally stored data blocks by providing multiple replicas of each block to choose from. Data block replication is costly as it increases the cost by  200% and other resources (e.g., network bandwidth when writing the data). For data blocks with relatively low I/O activity, the additional block replicas are rarely accessed during normal operations, but still consume the same amount of storage space.

Therefore, a natural improvement is to use erasure coding (EC) algorithm in place of replication, which uses far less storage space while with same level of fault tolerance. Hadoop 3.0 Erasure Coding (EC) balances the storage cost by ~50% compared with three times replication factor approach. Motivated by this substantial cost saving opportunity, engineers from Cloudera and Intel initiated and drove the HDFS-EC project under HDFS-7285 in collaboration with the broader Apache Hadoop community. HDFS-EC is currently targeted for release in Hadoop 3.0.

Hadoop 2.0 HDFS Replication Strategy

  • Three replicas by default
    • 1st replica on local node, or local track or random node
    • 2nd and 3rd replicas on the same remote rack
    • Reliability: tolerance 2 failures
  • Good data locality, local shortcut
  • Multiple copies ==> Parallel IO for parallel compute
  • Very fast block recovery and node recovery
    • Parallel recover
    • 10Tb node recovery 30sec to a few hours
  • 3/x storage overhead vs 1.4 – 1.6 Erasure coding
  • Hadoop’s JBod (JBOD stands for Just a Bunch of Disk. As the name suggests it is just a storage technique where all the storage disc are mounted on a server/machine) is much cheaper
    • 1/10  – 1/20 of SAN
    • 1/10 – 1/5 of NFS

Hadoop 3.0 Erasure coding (EC)

Hadoop 3.0 Erasure coding (EC) implementation is a stream of information theory which extends a message with redundant data for fault tolerance. An Erasure coding codec operates on units of uniformly-sized data termed cells. A codec can take as input a number of data cells and outputs a number of parity cells. This process is called encoding. Together, the data cells and parity cells are termed an erasure coding group. A lost cell can be reconstructed by computing over the remaining cells in the group; this process is called decoding.

Erasure Coding Parity

The simplest form of erasure coding is based on XOR (exclusive-or) operations, shown in picture. XOR operations are associative, meaning that XYZ = (XY) ⊕ Z. This means that XOR can generate 1 parity bit from an arbitrary number of data bits. For example, 1 ⊕ 0 ⊕ 1 ⊕ 1 = 1. When the third bit is lost, it can be recovered by XORing the remaining data bits {1, 0, 1} and the parity bit 1. While XOR can take any number of data cells as input, it is very limited since it can only produce at most one parity cell. So, XOR encoding with group size n can tolerate up to 1 failure with an efficiency of n-1/n (n-1 data cells for a group of n total cells), but is insufficient for systems like HDFS which need to tolerate multiple failures.

Hadoop 3. 0 Erasure Coding Strategy

  1. k data blocks + m parity blocks (k+m)
  2. Reliability: tolerate m failures
  3. Save disk space
  4. Save I/O bandwidth on the wire path

Hadoop 3. 0 EC Saving

Erasure Coding Reconstruction Strategy

Venky Jayaraman

20+ years of extensive expertise developing and managing enterprise level application. 7+ years of extensive experience building and implementing Bigdata solution.