PySpark Introduction

What is PySpark (is it pySpark or PySpark)?

Is it a new programming language or just another library? Or it is very different from Apache Spark?

In recnet past, PySpark (and not pySpark like Numpy) becomes a popular choice for many data engineers & machine learning engineers and becomes the mainstream technology for big data and machine learning projects.

PySpark is not a programming language. It is a wrapper (or abstraction) to help python developers to write spark code for distributed computing.

Many python libraries or framework has "Py" as their suffix or prefix like NumPy or PyTorch or SciPy. However not every libraries uses "Py" as their suffix or prefix like Pandas, Scikit-Learn.

Apache Spark is developed in Scala (a functional Programming Language written in Java) and to make Apache Spark more accessible to the larger developer communities, the creators of Spark added python API or R API so python data developers and machine learning engineers can also build data and machine learning solution on Apache Spark platform.

Databricks, the company behind Apache Spark also provides community edition version of Spark which allows developer to write Spark jobs using Jupyter Notebooks and support Scala, Python & R programming language besides shell/md kind of scripting and markup languages.

PySpark is programming language?

PySpark is not a programming language, it is an interface that is built for python developers. You as a developer write python code using spark libraries and build data & machine learning solutions. When executed, your PySpark/python code (for example data-processing.py) is first translated into bytecode which can be understood by JVM (Java Virtual Machine) and then executed as java byte code. If you write the same code in Scala or Java, they also get translated into Java byte code before it is executed. Since python or R code has to go through one additional step, it might be considered a little slow, however, it is not the truth. Converting python code to Scala or Java is just a onetime process and happens very fast. When working with Spark Dataframes, it works equally efficiently irrespective of what programming language you have used.

Why PySpark & Why Not Pandas?

PySaprk(or Apache Spark) is a distributed computing engine, however, Pandas is a popular Python package/library for data science. Pandas offer powerful, expressive and flexible data structures that make data manipulation and analysis easy. However, Pandas are not distributed and it runs in a single machine.

Apache Spark is a distributed computing engine that follows and extends the philosophy of Hadoop and MapReduce Paradigm. Apache Spark primarily uses its computation workload using memory and that is a key differentiator between Apache Spark and MapReduce programming world. Whether it is MapReduce or Apache Spark both follow the divide and conquer approach to bring the parallelism to speed up incomplete the data workloads. let's understand an elaborate using a simple example.

Let's assume if you have to calculate the total order value from an order data source and it has 50 million records. To compute it in a single machine (or a laptop having 8Gb RAM) may take a few minutes. However, if you want to compute the total order value in few seconds (if not milliseconds) you have to write advance parallel programming using some kind of partition (a logical chunk of data by slicing it) and compute it.

Parallelism also works only if your computer (laptop or desktop or servers) has multiple processors/cores. If you have only 1 processor and 1 core, even parallelism may not be effective. PySpark (Apache Spark) is build to handle distributed computing in large cluster machine (cluster = bunch of machines called nodes to distribute the work by taking advantage of local processors and cores).

Now if you write a very simple sum function, PySpark will internally follow the divide and conquer approach and do the same computation in a much shorter duration. All the overhead of dividing the data set, managing the failure (if any) and finally coordinating with a different machine to summarize is taken care of by the PySpark. If you have simple computation and which can be achieved using Pandas, you don't need PySpark or Apache Spark.

So if you are building a data solution which deals with small data set and your machine/server (compute engine) is able to perform those computation (reading data from disk or over the network, clean it, join/aggregate and write back to disk), you can use pandas or any other utilities but if you have to perform the same operation over millions & billions of rows with the less or same stipulated time, you may have to consider a solution like Spark which allows you to focus on your solution building rather writing distibuted computing codes.

Is PySaprk difficult to learn?

PySpark (Apache Spark) is a distributed computation engine for solve large data set problem. As every package or library has its syntax and style of code write, PySpark also has and it is certainly not hard to learn.

There are 3 main concept (Spark Context, Spark Conf & Transformation/Actions) which a developer must understand effectively to master PySpark (Apache Spark). Whatever you can do in Pandas, many things can be done in PySpark too and you must learn the syntax to achieve that. However there are certain operations which need to be done differently keeping distributed computing in mind. For example, if you are performing an average operation on a data set, that is very different from sum() or count() operation. Hence learning PySpark is easy but understanding different data engineering or machine learning use cases and write a program to make it run effectively takes sometime.

Official PySpark Documentation

Apache Spark Community has created a vast amount of documentation with many useful examples elaborating different use cases. Find the complete list of documentation below

  1. PySpark 2.5.4 API Reference
  2. PySpark SQL Reference with Dataframes
  3. Apache Spark Quick Start
  4. PySpark SQL Programming Reference
  5. Apache Spark Scala API Reference
  6. Apache Spark Deployment Overview

Additional PySpark Resource & Reading Material

PySpark Frequentl Asked Question

Refer our PySpark FAQ space where important queries and informations are clarified. It also links to important PySpark Tutorial apges with-in site.

PySpark Examples Code

Find our GitHub Repository which list PySpark Example with code snippet

PySpark/Spark Related Interesting Blogs

Here are the list of informative blogs and related articles, which you might find interesting

  1. PySpark Frequently Asked Questions
  2. Apach Spark Introduction
  3. How Spark Works
  4. PySpark Installation on Windows 10
  5. PySpark Jupyter Notebook Configuration On Windows
  6. PySpark Tutorial
  7. Apache Spark 3.0 Release Note (Preview)
  8. PySpark Complete Guide