Apache Spark is an exciting technology that is rapidly superseding Hadoop’s MapReduce as the preferred big data processing platform. Hadoop is an open source, distributed, Java computation framework consisting of the Hadoop Distributed File System (HDFS) and MapReduce, its execution engine. Spark is similar to Hadoop in that it’s a distributed, general-purpose computing platform. But Spark’s unique design, which allows for keeping large amounts of data in memory, offers tremendous performance improvements. Apache Spark programs can be 100 times faster than their
By supporting Python, Java, Scala, and, most recently, R, Spark is open to a wide range of users: to the science community that traditionally favors Python and R, to the still-widespread Java community, and to people using the increasingly popular Scala, which offers functional programming on the Java virtual machine (JVM).
Spark combines MapReduce-like capabilities for batch programming, real-time data-processing functions, SQL-like handling of structured data, graph algorithms, and machine learning, all in a single framework. This makes it a one-stop shop for most of your big data-crunching needs. It’s no wonder, then, that Spark is one of the busiest and fastest-growing Apache Software Foundation projects today
Apache Spark & Hadoop Framework
The Hadoop framework, with its HDFS as storage and MapReduce data-processing engine, was the first that brought distributed computing to the masses. Apache Hadoop solved the three main problems facing any distributed data-processing endeavor:
1) Parallelization—How to perform subsets of the computation simultaneously
2) Distribution—How to distribute the data
3) Fault tolerance—How to handle component failure
On top of that, Hadoop clusters are often made of commodity hardware, which makes Apache Hadoop easy to set up. That’s why the last decade saw its wide adoption.
Hadoop’s MapReduce computing has shortcomings. MapReduce job results need to be stored in HDFS before they can be used by another job. For this reason, MapReduce is inherently bad with iterative algorithms.
What Apache Spark Bring to Table?
Apache Spark’s core concept is an in-memory execution model that enables caching job data in memory instead of fetching it from disk every time, as MapReduce does. This can speed the spark job execution up to 100 times by bringing parallel computing, compared to the same jobs in Hadoop Map-Reduce; it has the biggest effect on iterative algorithms such as machine learning, graph algorithms, and other types of workloads that need to reuse data.
Apache Spark – Ease of Use
Spark supports the Scala, Java, Python, and R programming languages, so it’s accessible to a much wider audience. Spark shell (read-eval-print loop [REPL]) offers an interactive console that can be used for experimentation and idea testing. By contrast, the following is all it takes for the same Spark program written in Scala:
val spark = SparkSession.builder().appName("Spark wordcount") val file = spark.sparkContext.textFile("hdfs://...") val counts = file.flatMap(line => line.split(" ")) .map(word => (word, 1)).countByKey() counts.saveAsTextFile("hdfs://...")