Spark In-Memory Persistence And Memory Management
Spark In-Memory Persistence and Memory Management must be understood by engineering teams.Sparks performance advantage over MapReduce is greatest in use cases involvingrepeated computations. Much of this performance increase is due to Sparks use ofin-memory persistence. Rather than writing to disk between each pass through thedata, Spark has the option of keeping the data on the executors loaded into memory.That way, the data on each partition is available in-memory each time it needs to beaccessed.
Spark offers three options for memory management: in-memory as deserialized data,in-memory as serialized data, and on disk. Each has different space and time advantages:
- In memory as deserialized Java objects
- As serialized data
- On disk
In memory as deserialized Java objects
The most intuitive way to store objects in RDDs is as the original deserializedJava objects that are defined by the driver program. This form of in-memorystorage is the fastest since it reduces serialization time; however, it may not bethe most memory efficient, since it requires the data to be stored as objects.
As serialized data
Using the standard Java serialization library, Spark objects are converted intostreams of bytes as they are moved around the network. This approach maybe slower since serialized data is more CPU-intensive to read than deserializeddata; however, it is often more memory efficient, since it allows the userto choose a more efficient representation. While Java serialization is moreefficient than full objects, Kryo serialization can be even more space efficient.
On disk
RDDs, whose partitions are too large to be stored in RAM on each of theexecutors, can be written to disk. This strategy is obviously slower forrepeated computations but can be more fault-tolerant for long sequences of transformations, and maybe the only feasible option for enormous computations.
The persist() function in the RDD class lets the user control how the RDD isstored. By default, persist() stores an RDD as deserialized objects in memory, butthe user can pass one of nthe umerous storage options to the persist() function to controlhow the RDD is stored. We will cover the different options for RDD reuse inTypes of Reuse: Cache, Persist, Checkpoint, Shuffle Files . When persistingRDDs, the default implementation of RDDs evicts the least recently used partition(called LRU caching) if the space it takes is required to compute or to cache a new partition. However, you can change this behavior and control Sparks memoryprioritization with the persistencePriority() function in the RDD class.