What is new in Spark 3.0

This write up talks about Apache Spark 3.0 features and improvements. Apache Spark 3.0 is a major release and currently available in preview mode. Release 3.0.0 is a major release bundled with new features, and performance improvement besides deprecation of some of the old APIs.

Apache Spark 3.0 is also 17 times faster compared with current versions on the TPCDS benchmark which is quite impressive.

Apache Spark 3.0 Preview was released on 2019-Nov-08 and available for experimentation since then.

Apache Spark 3.0 Feature List

Language Support
Adaptive Execution
Binary files data source
Dynamic Partition Pruning
DataSource V2 Improvements
Enhanced Support for Deep Learning
Better Kubernetes Integration
Graph Feature
ACID Transaction with Delta Lake
Growing integration with Apache Arrow data format
YARN Features

Programming Language Support

Apache Spark 3.0 runs on Java 8/11, Scala 2.12, Python 2.7+/3.4+ and R 3.1+.
Java 8 prior to version 8u92 support is deprecated as of Spark 3.0.0.
Python 2 and Python 3 prior to version 3.6 support is deprecated as of Spark 3.0.0.
R prior to version 3.4 support is deprecated as of Spark 3.0.0.
For the Scala API, Spark 3.0.0-preview uses Scala 2.12.
You will need to use a compatible Scala version (2.12.x).

DPP : Dynamic Partition Pruning

Apache Spark 3.0 brought the concept of Dynamic Partition Pruning (DPP) which is a key performance improvement for SQL analytics workloads that in term can make integration with BI tools much better.

The concept behind Dynamic Partition Pruning (DPP) is to apply the filter set on the dimension table — mostly small and used in a broadcast hash join — directly on the fact table so it could skip scanning unneeded partitions.

Spark SQL: Adaptive Execution

This feature helps in where statistics on the data source do not exists or are in accurate. So far Spark had some optimizations which could be set only in the planning phase and according to the statistics of data (e.g. the ones captured by the ANALYZE command when deciding weather to perform a Broadcast-hash join over an expensive Sort-merge join. In cases in which these statistics are missing or not accurate BHJ might not kick in. with adaptive execution in Spark 3.0 spark can examine that data at runtime once he had loaded it and opt-in to BHJ at runtime even it could not detect it on the planning phase.

Graph features

Graph processing can be used in data science for several application including recommendation engine and fraud detections.

ACID Transactions with Delta Lake

Delta Lake is an open-source storage layer that brings ACID transactions to Apache Spark 3.0 and through easy implementation and upgrading of existing Spark applications, it brings reliability to Data Lakes.

Binary files data source

Apache Spark 3.0 supports binary file data source. You can use it like this:

val df = spark.read.format(“binaryFile”)

The above will read binary files and converts each one to a single row that contains the raw content and metadata of the file. The DataFrame will contain the following columns and possibly partition columns:

path: StringType
modificationTime: TimestampType
length: LongType
content: BinaryType

writing back a binary DataFrame/RDD is currently not supported.

Check out the Apache Spark 3.0.0 docs for more info on how to get the most out of Spark.

28 Apr 2020

« Snowflake Data Warehouse Glossary PySpark FAQ »

Topper Tips