Spark Model of Parallel Computing

Spark Model of Parallel Computing and sometimes also called RDD is an important API. Spark Model of Parallel Computing internally uses RDD and part of Spark Core library.

Spark allows users to write a program for the driver (or master node) on a cluster computing system that can perform operations on data in parallel. Spark represents large datasets as RDDs—immutable, distributed collections of objects—which are stored in the executors (or slave nodes). The objects that comprise RDDs are called partitions and maybe (but do not need to be) computed on different nodes of a distributed system. The Spark cluster manager handles starting and distributing the Spark executors across a distributed system according to the configuration parameters set by the Spark application. The Spark execution engine itself distributes data across the executors for a computation.

Rather than evaluating each transformation as soon as specified by the driver program, Spark evaluates RDDs lazily, computing RDD transformations only when the final RDD data needs to be computed (often by writing out to storage or collecting an aggregate to the driver). Spark can keep an RDD loaded in-memory on the executor
nodes throughout the life of a Spark application for faster access in repeated computations. As they are implemented in Spark, RDDs are immutable, so transforming an RDD returns a new RDD rather than the existing one. As we will explore in this

Spark Model of Parallel Computing & Lazy Evaluation

Many other systems for in-memory storage are based on “fine-grained” updates to mutable objects, i.e., calls to a particular cell in a table by storing intermediate results. In contrast, evaluation of RDDs is completely lazy. Spark does not begin computing the partitions until an action is called. An action is a Spark operation that returns
something other than an RDD, triggering evaluation of partitions and possibly returning some output to a non-Spark system (outside of the Spark executors); for example, bringing data back to the driver (with operations like count or collect) or writing data to an external storage storage system (such as copyToHadoop). Actions trigger the scheduler, which builds a directed acyclic graph (called the DAG), based on the dependencies between RDD transformations. In other words, Spark evaluates an action by working backward to define the series of steps it has to take to produce
each object in the final distributed dataset (each partition). Then, using this series of steps, called the execution plan, the scheduler computes the missing partitions for each stage until it computes the result.

Not all transformations are 100% lazy. sortByKey needs to evaluate the RDD to determine the range of data, so it involves both a transformation and an action.

Performance and usability advantages of lazy evaluation

Lazy evaluation allows Spark to combine operations that don’t require communication with the driver (called transformations with one-to-one dependencies) to avoid doing multiple passes through the data. For example, suppose a Spark program calls a map and a filter function on the same RDD. Spark can send the instructions for
both the map and the filter to each executor. Then Spark can perform both the map and filter on each partition, which requires accessing the records only once, rather than sending two sets of instructions and accessing each partition twice. This theoretically reduces the computational complexity by half.

Spark’s lazy evaluation paradigm is not only more efficient, but it is also easier to implement the same logic in Spark than in a different framework—like MapReduce—that requires the developer to do the work to consolidate her mapping operations. Spark’s clever lazy evaluation strategy lets us be lazy and express the same logic in far fewer
lines of code: we can chain together operations with narrow dependencies and let the Spark evaluation engine do the work of consolidating them.

Consider the classic word count example that, given a dataset of documents, parses the text into words and then computes the count for each word. Spark implementations are roughly fifteen lines of code in Java and five in Scala

def simpleWordCount(rdd: RDD[String]): RDD[(String, Int)] = {
   val words = rdd.flatMap(_.split(" "))
   val wordPairs = words.map((_, 1))
   val wordCounts = wordPairs.reduceByKey(_ + _)
   wordCounts
}

A further benefit of the Spark implementation of word count is that it is easier to modify and improve. Suppose that we now want to modify this function to filter out some “stop words” and punctuation from each document before computing the word count.

def withStopWordsFiltered(rdd : RDD[String], illegalTokens : Array[Char],
stopWords : Set[String]): RDD[(String, Int)] = {
   val separators = illegalTokens ++ Array[Char](' ')
   val tokens: RDD[String] = rdd.flatMap(_.split(separators).
   map(_.trim.toLowerCase))
   val words = tokens.filter(token =>
   !stopWords.contains(token) && (token.length > 0) )
   val wordPairs = words.map((_, 1))
   val wordCounts = wordPairs.reduceByKey(_ + _)
   wordCounts
}

Lazy evaluation and fault tolerance

Spark is fault-tolerant, meaning Spark will not fail, lose data, or return inaccurate results in the event of a host machine or network failure. Spark’s unique method of fault tolerance is achieved because each partition of the data contains the dependency information needed to recalculate the partition. Most distributed computing paradigms
that allow users to work with mutable objects provide fault tolerance by logging updates or duplicating data across machines.
In contrast, Spark does not need to maintain a log of updates to each RDD or log the actual intermediary steps, since the RDD itself contains all the dependency information needed to replicate each of its partitions. Thus, if a partition is lost, the RDD has enough information about its lineage to recompute it, and that computation can be
parallelized to make recovery faster.

Lazy evaluation and debugging

Lazy evaluation has important consequences for debugging since it means that a Spark program will fail only at the point of action. For example, suppose that you were using the word count example, and afterward were collecting the results to the driver. If the value you passed in for the stop words was null (maybe because it was the result of a Java program), the code would, of course, fail with a null pointer exception in the contains check. However, this failure would not appear until the program evaluated the collect step. Even the stack trace will show the failure as first occurring at the collect step, suggesting that the failure came from the collect statement. For this reason, it is probably most efficient to develop in an environment that gives you access to complete debugging information

Because of lazy evaluation, stack traces from failed Spark jobs (especially when embedded in larger systems) will often appear to fail consistently at the point of the action, even if the problem in the logic occurs in a transformation much earlier in the program

 

Comments are closed.