Spark In-Memory Persistence and Memory Management must be understood by engineering teams. Spark’s performance advantage over MapReduce is greatest in use cases involving repeated computations. Much of this performance increase is due to Spark’s use of in-memory persistence. Rather than writing to disk between each pass through the data, Spark has the option of keeping the data on the executors loaded into memory. That way, the data on each partition is available in-memory each time it needs to be accessed.
Spark offers three options for memory management: in-memory as deserialized data, in-memory as serialized data, and on disk. Each has different space and time advantages:
- In memory as deserialized Java objects
- As serialized data
- On disk
In memory as deserialized Java objects
The most intuitive way to store objects in RDDs is as the original deserialized Java objects that are defined by the driver program. This form of in-memory storage is the fastest since it reduces serialization time; however, it may not be the most memory efficient, since it requires the data to be stored as objects.
As serialized data
Using the standard Java serialization library, Spark objects are converted into streams of bytes as they are moved around the network. This approach may be slower since serialized data is more CPU-intensive to read than deserialized data; however, it is often more memory efficient, since it allows the user to choose a more efficient representation. While Java serialization is more efficient than full objects, Kryo serialization can be even more space efficient.
RDDs, whose partitions are too large to be stored in RAM on each of the executors, can be written to disk. This strategy is obviously slower for repeated computations but can be more fault-tolerant for long sequences of transformations, and maybe the only feasible option for enormous computations.
The persist() function in the RDD class lets the user control how the RDD is stored. By default, persist() stores an RDD as deserialized objects in memory, but the user can pass one of nthe umerous storage options to the persist() function to control how the RDD is stored. We will cover the different options for RDD reuse in “Types of Reuse: Cache, Persist, Checkpoint, Shuffle Files” . When persisting RDDs, the default implementation of RDDs evicts the least recently used partition (called LRU caching) if the space it takes is required to compute or to cache a
new partition. However, you can change this behavior and control Spark’s memory prioritization with the persistencePriority() function in the RDD class.