Spark Model Of Parallel Computing
Spark Model of Parallel Computing and sometimes also called RDD is an important API.Spark Model of Parallel Computing internally uses RDD and part of Spark Core library.
Spark allows users to write a program for the driver (or master node) on a clustercomputing system that can perform operations on data in parallel. Spark representslarge datasets as RDDsimmutable, distributed collections of objectswhich arestored in the executors (or slave nodes). The objects that comprise RDDs are calledpartitions and maybe (but do not need to be) computed on different nodes of a distributedsystem. The Spark cluster manager handles starting and distributing theSpark executors across a distributed system according to the configuration parametersset by the Spark application. The Spark execution engine itself distributes dataacross the executors for a computation.
Rather than evaluating each transformation as soon as specified by the driver program,Spark evaluates RDDs lazily, computing RDD transformations only when thefinal RDD data needs to be computed (often by writing out to storage or collecting anaggregate to the driver). Spark can keep an RDD loaded in-memory on the executor nodes throughout the life of a Spark application for faster access in repeated computations.As they are implemented in Spark, RDDs are immutable, so transforming anRDD returns a new RDD rather than the existing one. As we will explore in this
Spark Model of Parallel Computing & Lazy Evaluation
Many other systems for in-memory storage are based on fine-grained updates tomutable objects, i.e., calls to a particular cell in a table by storing intermediate results. In contrast, evaluation of RDDs is completely lazy. Spark does not begin computingthe partitions until an action is called. An action is a Spark operation that returns something other than an RDD, triggering evaluation of partitions and possiblyreturning some output to a non-Spark system (outside of the Spark executors); forexample, bringing data back to the driver (with operations like count or collect) orwriting data to an external storage storage system (such as copyToHadoop). Actionstrigger the scheduler, which builds a directed acyclic graph (called the DAG), basedon the dependencies between RDD transformations. In other words, Spark evaluatesan action by working backward to define the series of steps it has to take to produce each object in the final distributed dataset (each partition). Then, using this series ofsteps, called the execution plan, the scheduler computes the missing partitions foreach stage until it computes the result.
Not all transformations are 100% lazy. sortByKey needs to evaluatethe RDD to determine the range of data, so it involves both a transformationand an action.
Performance and usability advantages of lazy evaluation
Lazy evaluation allows Spark to combine operations that dont require communicationwith the driver (called transformations with one-to-one dependencies) to avoiddoing multiple passes through the data. For example, suppose a Spark program calls amap and a filter function on the same RD
D. Spark can send the instructions for
both the map and the filter to each executor. Then Spark can perform both the mapand filter on each partition, which requires accessing the records only once, ratherthan sending two sets of instructions and accessing each partition twice. This theoreticallyreduces the computational complexity by half.
Sparks lazy evaluation paradigm is not only more efficient, but it is also easier to implementthe same logic in Spark than in a different frameworklike MapReducethatrequires the developer to do the work to consolidate her mapping operations. Sparksclever lazy evaluation strategy lets us be lazy and express the same logic in far fewer lines of code: we can chain together operations with narrow dependencies and let theSpark evaluation engine do the work of consolidating them.
Consider the classic word count example that, given a dataset of documents, parsesthe text into words and then computes the count for each word.Spark implementationsare roughly fifteen lines of code in Java and five in Scala
def simpleWordCount(rdd: RDD[String]): RDD[(String, Int)] = { val words = rdd.flatMap(_.split(" ")) val wordPairs = words.map((_, 1)) val wordCounts = wordPairs.reduceByKey(_ + _) wordCounts }
A further benefit of the Spark implementation of word count is that it is easier tomodify and improve. Suppose that we now want to modify this function to filter outsome stop words and punctuation from each document before computing the wordcount.
def withStopWordsFiltered(rdd : RDD[String], illegalTokens : Array[Char], stopWords : Set[String]): RDD[(String, Int)] = { val separators = illegalTokens ++ Array[Char](' ') val tokens: RDD[String] = rdd.flatMap(_.split(separators). map(_.trim.toLowerCase)) val words = tokens.filter(token = !stopWords.contains(token) && (token.length 0) ) val wordPairs = words.map((_, 1)) val wordCounts = wordPairs.reduceByKey(_ + _) wordCounts }
Lazy evaluation and fault tolerance
Spark is fault-tolerant, meaning Spark will not fail, lose data, or return inaccurateresults in the event of a host machine or network failure. Sparks unique method offault tolerance is achieved because each partition of the data contains the dependencyinformation needed to recalculate the partition. Most distributed computing paradigms that allow users to work with mutable objects provide fault tolerance by loggingupdates or duplicating data across machines. In contrast, Spark does not need to maintain a log of updates to each RDD or log theactual intermediary steps, since the RDD itself contains all the dependency informationneeded to replicate each of its partitions. Thus, if a partition is lost, the RDD has enough information about its lineage to recompute it, and that computation can be parallelized to make recovery faster.
Lazy evaluation and debugging
Lazy evaluation has important consequences for debugging since it means that aSpark program will fail only at the point of action. For example, suppose that youwere using the word count example, and afterward were collecting the results to thedriver. If the value you passed in for the stop words was null (maybe because it wasthe result of a Java program), the code would, of course, fail with a null pointer exceptionin the contains check. However, this failure would not appear until the programevaluated the collect step. Even the stack trace will show the failure as first occurringat the collect step, suggesting that the failure came from the collect statement. For thisreason, it is probably most efficient to develop in an environment that gives youaccess to complete debugging information
Because of lazy evaluation, stack traces from failed Spark jobs(especially when embedded in larger systems) will often appear tofail consistently at the point of the action, even if the problem inthe logic occurs in a transformation much earlier in the program