Many spark users who are using Hadoop as storage under the Spark computation is asking if Hadoop 3.0 vs Spark 2.x compatible or not. Spark 2.2.1 was released the 1st week of Dec 2017 and Hadoop 3.0 GA was released on 2nd week of Dec 2017. It is obvious that this compatibility is not fully tested and it not running anywhere in production.
Hadoop 3.0 vs Spark 2.x comparison
Hadoop 3.0 brings lot of new features under its hood and Apache Spark 2.2.1 also has many few functions and both big-data frameworks. Both these frameworks don’t serve the same purposes. Hadoop 3.0 is essentially a distributed data infrastructure with efficient storage: It distributes massive data collections across multiple nodes within a cluster of commodity servers, which means you don’t need to buy and maintain expensive custom hardware. It also indexes and keeps track of that data, enabling big-data processing and analytics far more effectively than was possible previously. Spark, on the other hand, is a data-processing tool that operates on those distributed data collections; it doesn’t do distributed storage.
Really need Hadoop 3.0 for Spark
You can use one without the other. Hadoop 3.0 includes not just a storage component, known as the Hadoop Distributed File System with erasure coding implementation, but also a computing component called MapReduce, so you don’t need Spark 2.x to get your computing done. Vise versa, one can also use Spark 2.x without Hadoop 3.x. Spark 2.x does not come with its own file management system , so it needs to be integrated with one — if not HDFS, then another cloud-based data platform (Cassandra, HBase, and S3). Spark 2.x was designed for Hadoop, however, so many agree they’re better together.
Hadoop 3.x MapReduce Vs Spark 2.x Computation
Spark is certainly speedier than MapReduce. Spark 2.x is generally a lot faster than MapReduce because of the way it processes data. While MapReduce operates in steps, Spark operates on the whole data set in one fell swoop. “The MapReduce workflow looks like this: read data from the cluster, perform an operation, write results to the cluster, read updated data from the cluster, perform next operation, write next results to the cluster, etc.,” explained Kirk Borne, principal data scientist at Booz Allen Hamilton. Spark, on the other hand, completes the full data analytics operations in-memory and in near real-time: “Read data from the cluster, perform all of the requisite analytic operations, write results to the cluster, done,” Borne said. Spark can be as much as 10 times faster than MapReduce for batch processing and up to 100 times faster for in-memory analytics, he said.
Failure recovery: different, but still good
Hadoop is naturally resilient to system faults or failures since data are written to disk after every operation, but Spark has similar built-in resiliency by virtue of the fact that its data objects are stored in something called resilient distributed datasets distributed across the data cluster. “These data objects can be stored in memory or on disks, and RDD provides full recovery from faults or failures,” Borne pointed out.