Immutability And Rdd Interface In Spark
Immutability and RDD Interface in Spark are key concepts and it must be understood in detail.Spark defines an RDD interface with the properties that each type of RDD mustimplement. These properties include the RDDs dependencies and information aboutdata locality that are needed for the execution engine to compute that RD
D. SinceRDDs are statically typed and immutable, calling a transformation on one RDD willnot modify the original RDD but rather return a new RDD object with a new definitionof the RDDs properties
RDDs can be created in three ways: (1) by transforming an existing RDD; (2) from aSparkContext, which is the APIs gateway to Spark for your application; and (3) converting a DataFrame or Dataset. The SparkContext represents the connection between a Spark cluster and one running Sparkapplication. The SparkContext can be used to create an RDD from a local Scalaobject (using the make RDD or parallelize methods) or by reading from stable storage(text files, binary files, a Hadoop Context, or a Hadoop file). DataFrames and Datasets can be read using the Spark SQL equivalent to a SparkContext, the SparkSession.
Internally, Spark uses five main properties to represent an RD
D. The three requiredproperties are the list of partition objects that make up the RDD, a function for computingan iterator of each partition, and a list of dependencies on other RDDs.Optionally, RDDs also include a partitioner (for RDDs of rows of key/value pairs representedas Scala tuples) and a list of preferred locations (for the HDFS file). As anend user, you will rarely need these five properties and are more likely to use predefinedRDD transformations. However, it is helpful to understand the properties andknow how to access them for debugging and for a better conceptual understanding.
These five properties correspond to the following five methods available to the enduser (you):
- partitions() Returns an array of the partition objects that make up the parts of the distributeddataset. In the case of an RDD with a partitioner, the value of the index of eachpartition will correspond to the value of the getPartition function for each keyin the data associated with that partition.
- iterator(p, parentIters)
Computes the elements of partition p given iterators for each of its parent partitions.This function is called in order to compute each of the partitions in thisRD
D. This is not intended to be called directly by the user. Rather, this is used bySpark when computing actions. Still, referencing the implementation of thisfunction can be useful in determining how each partition of an RDD transformation is evaluated. - dependencies() Returns a sequence of dependency objects. The dependencies let the schedulerknow how this RDD depends on other RDDs. There are two kinds of dependencies:narrow dependencies (NarrowDependency objects), which represent partitionsthat depend on one or a small subset of partitions in the parent, and widedependencies (ShuffleDependency objects), which are used when a partition canonly be computed by rearranging all the data in the parent.
- partitioner() Returns a Scala option type of a partitioner object if the RDD has a functionbetween element and partition associated with it, such as a hashPartitioner.This function returns None for all RDDs that are not of type tuple (do not representkey/value data). An RDD that represents an HDFS file (implemented inNewHadoopRDD.scala) has a partition for each block of the file.
- preferredLocations(p) Returns information about the data locality of a partition, p. Specifically, thisfunction returns a sequence of strings representing some information about eachof the nodes where the split p is stored. In an RDD representing an HDFS file,each string in the result of preferredLocations is the Hadoop name of the node where that partition is stored.
Types of RDDs
The implementation of the Spark Scala API contains an abstract class, RDD, whichcontains not only the five core functions of RDDs, but also those transformations andactions that are available to all RDDs, such as map and collect. Functions definedonly on RDDs of a particular type are defined in several RDD function classes,
including PairRDDFunctions, OrderedRDDFunctions, and GroupedRDDFunctions.The additional methods in these classes are made available by an implicit conversionfrom the abstract RDD class, based on type information or when a transformation isapplied to an RDD.
The Spark API also contains implementations of the RDD class that define more specificbehavior by overriding the core properties of the RD
D. These include the NewHadoopRDD class discussed previouslywhich represents an RDD created from anHDFS filesystemand ShuffledRDD, which represents an RDD that was already partitioned.
Each of these RDD implementations contains functionality that is specific to RDDs of that type. Creating an RDD, either through a transformation or from aSparkContext, will return one of these implementations of the RDD class. Some RDD operations have a different signature in Java than in Scala. These are defined inthe JavaRDD.java class.
Functions on RDDs: Transformations Versus Actions
There are two types of functions defined on RDDs: actions and transformations.Actions are functions that return something that is not an RDD, including a sideeffect, and transformations are functions that return another RDD. Each Spark program must contain an action since actions either bring informationback to the driver or write the data to stable storage. Actions are what force evaluationof a Spark program. Persist calls also force evaluation, but usually, do not markthe end of Spark job. Actions that bring data back to the driver include collect,count, collectAsMap, sample, reduce, and take.