Immutability and RDD Interface in Spark are key concepts and it must be understood in detail. Spark defines an RDD interface with the properties that each type of RDD must implement. These properties include the RDD’s dependencies and information about data locality that are needed for the execution engine to compute that RDD. Since RDDs are statically typed and immutable, calling a transformation on one RDD will not modify the original RDD but rather return a new RDD object with a new definition of the RDD’s properties
RDDs can be created in three ways: (1) by transforming an existing RDD; (2) from a SparkContext, which is the API’s gateway to Spark for your application; and (3) converting a DataFrame or Dataset. The SparkContext represents the connection between a Spark cluster and one running Spark application. The SparkContext can be used to create an RDD from a local Scala object (using the make RDD or parallelize methods) or by reading from stable storage (text files, binary files, a Hadoop Context, or a Hadoop file). DataFrames and Datasets can be read using the Spark SQL equivalent to a SparkContext, the SparkSession.
Internally, Spark uses five main properties to represent an RDD. The three required properties are the list of partition objects that make up the RDD, a function for computing an iterator of each partition, and a list of dependencies on other RDDs. Optionally, RDDs also include a partitioner (for RDDs of rows of key/value pairs represented as Scala tuples) and a list of preferred locations (for the HDFS file). As an end user, you will rarely need these five properties and are more likely to use predefined RDD transformations. However, it is helpful to understand the properties and know how to access them for debugging and for a better conceptual understanding.
These five properties correspond to the following five methods available to the end user (you):
Returns an array of the partition objects that make up the parts of the distributed dataset. In the case of an RDD with a partitioner, the value of the index of each partition will correspond to the value of the getPartition function for each key in the data associated with that partition.
- iterator(p, parentIters)
Computes the elements of partition p given iterators for each of its parent partitions. This function is called in order to compute each of the partitions in this RDD. This is not intended to be called directly by the user. Rather, this is used by Spark when computing actions. Still, referencing the implementation of this function can be useful in determining how each partition of an RDD transformation is evaluated.
Returns a sequence of dependency objects. The dependencies let the scheduler know how this RDD depends on other RDDs. There are two kinds of dependencies: narrow dependencies (NarrowDependency objects), which represent partitions that depend on one or a small subset of partitions in the parent, and wide dependencies (ShuffleDependency objects), which are used when a partition can only be computed by rearranging all the data in the parent.
Returns a Scala option type of a partitioner object if the RDD has a function between element and partition associated with it, such as a hashPartitioner. This function returns None for all RDDs that are not of type tuple (do not represent key/value data). An RDD that represents an HDFS file (implemented in NewHadoopRDD.scala) has a partition for each block of the file.
Returns information about the data locality of a partition, p. Specifically, this function returns a sequence of strings representing some information about each of the nodes where the split p is stored. In an RDD representing an HDFS file, each string in the result of preferredLocations is the Hadoop name of the node
where that partition is stored.
Types of RDDs
The implementation of the Spark Scala API contains an abstract class, RDD, which contains not only the five core functions of RDDs, but also those transformations and actions that are available to all RDDs, such as map and collect. Functions defined only on RDDs of a particular type are defined in several RDD function classes,
including PairRDDFunctions, OrderedRDDFunctions, and GroupedRDDFunctions. The additional methods in these classes are made available by an implicit conversion from the abstract RDD class, based on type information or when a transformation is applied to an RDD.
The Spark API also contains implementations of the RDD class that define more specific behavior by overriding the core properties of the RDD. These include the NewHadoopRDD class discussed previously—which represents an RDD created from an HDFS filesystem—and ShuffledRDD, which represents an RDD that was already partitioned.
Each of these RDD implementations contains functionality that is specific to RDDs of that type. Creating an RDD, either through a transformation or from a SparkContext, will return one of these implementations of the RDD class. Some RDD operations have a different signature in Java than in Scala. These are defined in the JavaRDD.java class.
Functions on RDDs: Transformations Versus Actions
There are two types of functions defined on RDDs: actions and transformations. Actions are functions that return something that is not an RDD, including a side effect, and transformations are functions that return another RDD.
Each Spark program must contain an action since actions either bring information back to the driver or write the data to stable storage. Actions are what force evaluation of a Spark program. Persist calls also force evaluation, but usually, do not mark the end of Spark job. Actions that bring data back to the driver include collect, count, collectAsMap, sample, reduce, and take.