pySpark Interactive shell Installation on Windows

pySpark Interactive Shell Installation on Windows machine is fairly easy and straight forward task. Setting up pySpark interactive shell needs certain predefined softeare. This tutorial will help you to with a step by step instruction to set it up.

pySpark Interactive Shell on Windows Pre-requisit

So the first thing when we talk about Spark is to make sure that your Windows installation has a working Java version. So to check that is very easy, we type Java – version and so it pops up. So here we have runtime environment at 1.8, and you might have something slightly different, it’s fine, as long as you have Java you are good to go.

JDK or JRE (Java Runtime Environment) Installation verification for pySpark

So next thing to do is to make sure you have a Python installation on your Windows. It is better to have winpython, and to make sure that you have the right Python version all you need to do is do Python – hyphens version and you get your current version number. And as long as it’s anywhere near 3.6 it should be fine, 3.7 even better

python installation verification for pySpark

Download Spark or pySpark

To download Spark or pySpark all you need to do is go to the Spark home page and click on download. You can choose a Spark release (2.3.2). And then you can also choose a package type which determines which Hadoop version you’re going to need (pre-built Hadoop 2.7 and later). Once selected, just click on “Downloaded Spark” the Spark files from the internet. The download will get a zipped file which need to unzipped it into a folder on your local drive.

So let’s first navigate to the relevant folder (or local drive) where the file is unzipped. And then the next thing to do is to run the PySpark command in the binary folder. As you can see, there are some warnings and some information is quite a bit of information here. But the important thing is we have Spark showing up, and it says you know version 2.3, welcome to Spark. And so let’s look at a little bit at how we can resolve these errors.

Unzip spark binaries and run \bin\pyspark command
pySpark Interactive Shell with Welcome Screen

Hadoop Winutils Utility for pySpark

One of the issues that the console shows is the fact that pySpark is reporting an I/O exception from the Java underlying library. And what it is saying is that it could not locate the executable when winutils. This executables is a mock program which mimic Hadoop distribution file system in windows machine.

Next thing to do is to go into the Hadoop footer and then make a bin folder. So we now have, inside our Spark path, we have a Hadoop folder which inside will have a bin folder. Inside this bin folder we’re going to go to GitHub and download the relevance when utils executable, and that’s, we need to download executable in a version that is consistent with the Hadoop version we’re using, in this case is Hadoop 2.7.

go inside your spark folder and move to <spark-folder>\hadoop\bin
download winutils.exe and save it under <spark-folder>\hadoop\bin

SPARK_HOME & HADOOP_HOME Environment Variables

When you execute <spark-folder>/bin/pyspark.bat file, it try to find 2 environment variables from windows Operating system. SPARK_HOME and HADOOP_HOME are the two variables which it look for.

If you have admin privileges on your Windows machine, then you can set those variables or you can you open a command prompt and set them using set command

Environment Variables using Advance System Setting
Environment Variables Using Command Line (CLI)
set HADOOP_HOME=c:/spark-2.3.2-bin-hadoop2.7/hadoop/bin
set SPARK_HOME=c:/spark-2.3.2-bin-hadoop2.7/

Now you start the /bin/pyspark.bat and your interactive shell appears without any errors.

pySpark Interactive Shell with Hadoop winutils.exe

To exit the pySpark interactive shell, run >exit();

You can get complet pySpark tutorial, you can follow pySpark tutorial guide.