PySpark Data Frame Rearrange Columns

Rearranging columns in PySpark or Spark (Scala) data frame should not be a difficult job. PySpark or Spark/Scala provides a lot of convenient APIs to arrange the columns to meet expected output. In this article, we will explore about different alternatives and the best approach to rearrange columns using PySpark API

We will primarily use SparkSQL as well as dataframe approach with examples to demonstrate how easily you can rearrange columns and get the expected output. SparkSQL approach might be the simplest and most efficient way of rearranging columns however it might be little verbose in some scenarios if you have a large dataframe. Whether you follow SparksSQL way or data frame approach or something innovative way, you can always analyze the performance of rearranging columns using explain function and validate the speed of your approach.

Example Data Set

Use a simple data set for a better understanding of our code. Bollywood first start with creating data frames and we will also list down what is the expectation for rearranging the columns

#import necessary files
from pyspark.sql.types import StructType, StructField, IntegerType, StringType

#create a sample list and schema. 
tips_list = [
	[1,"item 1" , 2],
	[2 ,"item 2", 3],
	1 ,"item 3" ,2],
	[3 ,"item 4" , 5]]
tips_schema = StructType(
	[StructField("Item_ID",IntegerType(), True),
	StructField("Item_Name",StringType(), True ),
	StructField("Quantity",IntegerType(), True)])

Now the result will appear like this

Rearrange Column Using SparkSQL

to get most out of spark sequel you can register the data frame as a table and then write a simple select statement and project that necessary column in a certain sequence that meets expectation. it is a myth that using Spark sequel is less effective than data free however it is completely wrong Spark has a catalyst called tungsten which generate the final physical plan after proper Optimisation in an in most of the cases you don’t have to worry about the performance of your Spark job execution

#create dataframe using spark session
tips_df = spark.createDataFrame(tips_list, tips_schema)

#register as table
tips_df.createOrReplaceTempView("tips_table")

tips_new_df = spark.sql("select Item_Name, Item_ID, Quantity from tips_table")

tips_new_df.printSchema()

Now the result will appear like this

Rearrange Column Using Dataframe

If you feel that your project does not allow to use Spark sequel you can also follow the data from API and arrange the column to meet the expectation

#create dataframe using spark session
tips_df = spark.createDataFrame(tips_list, tips_schema)

#use the select function from the dataframe
tips_new_df = tips_df.select("Item_Name","Item_ID","Quantity")

#this will also generate the same schema
tips_new_df.printSchema()

Possible Analysis Exception Error

There are cases when you end up in exception like this. Just check if your approach is correct or not and use the printSchema() function to validate that

pyspark.sql.utils.AnalysisException: u"Reference 'id' is ambiguous, could be: id#609, id#1224.;"

Alternative Ways to Rearrange Columns

Explain the Execution Plan for Rearrange Columns for Dataframe

If you execute the df.explain(True) on your dataframe, you will see how this dataframe is populated

Performance Impact

As you have already seen there is no performance impact no matter what approach you follow.

Source Code in GitHub

Get the source code from our git repository

Additional PySpark Resource & Reading Material

PySpark Frequentl Asked Question

Refer our PySpark FAQ space where important queries and informations are clarified. It also links to important PySpark Tutorial apges with-in site.

PySpark Examples Code

Find our GitHub Repository which list PySpark Example with code snippet

PySpark/Spark Related Interesting Blogs

Here are the list of informative blogs and related articles, which you might find interesting

03 May 2020

Apache Hive

« PySpark Installation on Windows 10 PySpark Introduction »

Topper Tips