How Apache Spark Works

Why you should be worried about how Apache Spark works? To get the most out of Spark, it is important to understand some of the principlesused to design Spark and, at a cursory level, how Spark programs are executed this article introduces the overall design of Spark as well as its place in the big dataecosystem. Spark is often considered an alternative to Apache MapReduce, since Spark can also be used for distributed data processing with Hadoop. Sparks design principles are quite different from those of MapReduce. Unlike Hadoop MapReduce, Spark does not need to be run in tandem withApache Hadoopalthough it often is. Spark has inherited parts of its API, design,and supported formats from other existing computational frameworks, particularlyDryadLINQ.2 However, Sparks internals, especially how it handles failures, differfrom many traditional systems. Sparks ability to leverage lazy evaluation withinmemory computations makes it particularly unique. Sparks creators believe it to bethe first high-level programming language for fast, distributed data processing.

How Apache Spark Works & DryadLINQ

DryadLINQ is a Microsoft research project that puts the .NET Language Integrated Query (LINQ) on top ofthe Dryad distributed execution engine. Like Spark, the DryadLINQ API defines an object representing a distributed dataset and then exposes functions to transform data as methods defined on that dataset object.DryadLINQ is lazily evaluated and its schedule is similar to Sparks. However, DryadLINQ doesnt use inmemorystorage.For more information see the DryadLINQ documentation

How Spark Fits into the Big Data Ecosystem

Apache Spark is an open source framework that provides methods to process data inparallel that are generalizable; the same high-level Spark functions can be used to performdisparate data processing tasks on data of different sizes and structures. On its own, Spark is not a data storage solution; it performs computations on Spark JVMs(Java Virtual Machines) that last only for the duration of a Spark application. Sparkcan be run locally on a single machine with a single JVM (called local mode). Moreoften, Spark is used in tandem with a distributed storage system (e.g., HDFS, Cassandra,or S3) and a cluster managerthe storage system to house the data processedwith Spark, and the cluster manager to orchestrate the distribution of Spark applicationsacross the cluster. Spark currently supports three kinds of cluster managers:Standalone Cluster Manager, Apache Mesos, and Hadoop YARN.The Standalone Cluster Manager is included in Spark, but using the Standalone managerrequires installing Spark on each node of the cluster.

Spark Execution Model

Spark Components

Spark provides a high-level query language to process data. Spark Core, the maindata processing framework in the Spark ecosystem, has APIs in Scala, Java, Python,and R. Spark are built around a data abstraction called Resilient Distributed Datasets (RDDs). RDDs are a representation of lazily evaluated, statically typed, distributed collections. RDDs have a number of predefined coarse-grained transformations(functions that are applied to the entire dataset), such as map, join, and reduce to manipulate the distributed datasets, as well as I/O functionality to read and write data between the distributed storage system and the Spark JVMs

In addition to Spark Core, the Spark ecosystem includes a number of other first-partycomponents, including Spark SQL, Spark MLlib, Spark ML, Spark Streaming, andGraphX,4 which provide more specific data processing functionality. Some of thesecomponents have the same general performance considerations as the Core; MLlib, for example, is written almost entirely on the Spark API. However, some of themhave unique considerations. Spark SQL, for example, has a different query optimizerthan Spark Core.

Spark SQL: Spark Component

Spark SQL is a component that can be used in tandem with Spark Core and has APIsin Scala, Java, Python, and R and basic SQL queries. Spark SQL defines an interfacefor a semi-structured data type, called DataFrames, and as of Spark 1.6, a semistructured, typed version of RDDs called Datasets. Spark SQL is a veryimportant component of Spark performance, and much of what can be accomplishedwith Spark Core can be done by leveraging Spark SQL.

Spark Machine Learning Packages

Spark has two machine learning packages: ML and MLlib. MLlib is a package ofmachine learning and statistics algorithms written with Spark. Spark ML is still in theearly stages and has only existed since Spark 1.2. Spark ML provides a higher-levelAPI than MLlib with the goal of allowing users to more easily create practical machine learning pipelines. Spark MLlib is primarily built on top of RDDs and usesfunctions from Spark Core, while ML is built on top of Spark SQL DataFrames.Eventually, the Spark community plans to move over to ML and deprecate MLlib.Spark ML and MLlib both have additional performance considerations from Spark Core and Spark SQL

Spark Streaming

Spark Streaming uses the scheduling of the Spark Core for streaming analytics onmini-batches of data. Spark Streaming has a number of unique considerations, such asthe window sizes used for batches.

GraphX

GraphX is a graph processing framework built on top of Spark with an API for graphcomputations. GraphX is one of the least mature components of Spark, so we dontcover it in much detail. In future versions of Spark, typed graph functionality will beintroduced on top of the Dataset API.

Additional PySpark Resource & Reading Material

PySpark Frequentl Asked Question

Refer our PySpark FAQ space where important queries and informations are clarified. It also links to important PySpark Tutorial apges with-in site.

PySpark Examples Code

Find our GitHub Repository which list PySpark Example with code snippet

PySpark/Spark Related Interesting Blogs

Here are the list of informative blogs and related articles, which you might find interesting

27 Sep 2018

Apache Spark

« Beginners Impala Tutorial Spark Model Of Parallel Computing »

Topper Tips

How Apache Spark Works