Why you should be worried about how Apache Spark works? To get the most out of Spark, it is important to understand some of the principles used to design Spark and, at a cursory level, how Spark programs are executed this article introduces the overall design of Spark as well as its place in the big data ecosystem. Spark is often considered an alternative to Apache MapReduce, since Spark can also be used for distributed data processing with Hadoop. Spark’s design principles are quite different from those of Map‐Reduce. Unlike Hadoop MapReduce, Spark does not need to be run in tandem with Apache Hadoop—although it often is. Spark has inherited parts of its API, design, and supported formats from other existing computational frameworks, particularly DryadLINQ.2 However, Spark’s internals, especially how it handles failures, differ from many traditional systems. Spark’s ability to leverage lazy evaluation within memory computations makes it particularly unique. Spark’s creators believe it to be the first high-level programming language for fast, distributed data processing.
How Apache Spark Works & DryadLINQ
DryadLINQ is a Microsoft research project that puts the .NET Language Integrated Query (LINQ) on top of the Dryad distributed execution engine. Like Spark, the DryadLINQ API defines an object representing a distributed
dataset and then exposes functions to transform data as methods defined on that dataset object. DryadLINQ is lazily evaluated and its schedule is similar to Spark’s. However, DryadLINQ doesn’t use inmemory storage. For more information see the DryadLINQ documentation
How Spark Fits into the Big Data Ecosystem
Apache Spark is an open source framework that provides methods to process data in parallel that are generalizable; the same high-level Spark functions can be used to perform disparate data processing tasks on data of different sizes and structures. On its own, Spark is not a data storage solution; it performs computations on Spark JVMs (Java Virtual Machines) that last only for the duration of a Spark application. Spark can be run locally on a single machine with a single JVM (called local mode). More often, Spark is used in tandem with a distributed storage system (e.g., HDFS, Cassandra, or S3) and a cluster manager—the storage system to house the data processed with Spark, and the cluster manager to orchestrate the distribution of Spark applications across the cluster. Spark currently supports three kinds of cluster managers: Standalone Cluster Manager, Apache Mesos, and Hadoop YARN. The Standalone Cluster Manager is included in Spark, but using the Standalone manager requires installing Spark on each node of the cluster.
Spark provides a high-level query language to process data. Spark Core, the main data processing framework in the Spark ecosystem, has APIs in Scala, Java, Python, and R. Spark are built around a data abstraction called Resilient Distributed Datasets (RDDs). RDDs are a representation of lazily evaluated, statically typed, distributed
collections. RDDs have a number of predefined “coarse-grained” transformations (functions that are applied to the entire dataset), such as map, join, and reduce to manipulate the distributed datasets, as well as I/O functionality to read and write data between the distributed storage system and the Spark JVMs
In addition to Spark Core, the Spark ecosystem includes a number of other first-party components, including Spark SQL, Spark MLlib, Spark ML, Spark Streaming, and GraphX,4 which provide more specific data processing functionality. Some of these components have the same general performance considerations as the Core; MLlib,
for example, is written almost entirely on the Spark API. However, some of them have unique considerations. Spark SQL, for example, has a different query optimizer than Spark Core.
Spark SQL: Spark Component
Spark SQL is a component that can be used in tandem with Spark Core and has APIs in Scala, Java, Python, and R and basic SQL queries. Spark SQL defines an interface for a semi-structured data type, called DataFrames, and as of Spark 1.6, a semistructured, typed version of RDDs called Datasets. Spark SQL is a very important component of Spark performance, and much of what can be accomplished with Spark Core can be done by leveraging Spark SQL.
Spark Machine Learning Packages
Spark has two machine learning packages: ML and MLlib. MLlib is a package of machine learning and statistics algorithms written with Spark. Spark ML is still in the early stages and has only existed since Spark 1.2. Spark ML provides a higher-level API than MLlib with the goal of allowing users to more easily create practical
machine learning pipelines. Spark MLlib is primarily built on top of RDDs and uses functions from Spark Core, while ML is built on top of Spark SQL DataFrames. Eventually, the Spark community plans to move over to ML and deprecate MLlib. Spark ML and MLlib both have additional performance considerations from Spark
Core and Spark SQL
Spark Streaming uses the scheduling of the Spark Core for streaming analytics on mini-batches of data. Spark Streaming has a number of unique considerations, such as the window sizes used for batches.
GraphX is a graph processing framework built on top of Spark with an API for graph computations. GraphX is one of the least mature components of Spark, so we don’t cover it in much detail. In future versions of Spark, typed graph functionality will be introduced on top of the Dataset API.