PySpark Data Frame Rearrange Columns
Rearranging columns in PySpark or Spark (Scala) data frame should not be a difficult job. PySpark or Spark/Scala provides a lot of convenient APIs to arrange the columns to meet expected output. In this article, we will explore about different alternatives and the best approach to rearrange columns using PySpark API
We will primarily use SparkSQL as well as dataframe approach with examples to demonstrate how easily you can rearrange columns and get the expected output. SparkSQL approach might be the simplest and most efficient way of rearranging columns however it might be little verbose in some scenarios if you have a large dataframe. Whether you follow SparksSQL way or data frame approach or something innovative way, you can always analyze the performance of rearranging columns using explain function and validate the speed of your approach.
Example Data Set
Use a simple data set for a better understanding of our code. Bollywood first start with creating data frames and we will also list down what is the expectation for rearranging the columns
Now the result will appear like this
Rearrange Column Using SparkSQL
to get most out of spark sequel you can register the data frame as a table and then write a simple select statement and project that necessary column in a certain sequence that meets expectation. it is a myth that using Spark sequel is less effective than data free however it is completely wrong Spark has a catalyst called tungsten which generate the final physical plan after proper Optimisation in an in most of the cases you don’t have to worry about the performance of your Spark job execution
Now the result will appear like this
Rearrange Column Using Dataframe
If you feel that your project does not allow to use Spark sequel you can also follow the data from API and arrange the column to meet the expectation
Possible Analysis Exception Error
There are cases when you end up in exception like this. Just check if your approach is correct or not and use the printSchema() function to validate that
Alternative Ways to Rearrange Columns
Explain the Execution Plan for Rearrange Columns for Dataframe
If you execute the df.explain(True) on your dataframe, you will see how this dataframe is populated
Performance Impact
As you have already seen there is no performance impact no matter what approach you follow.
Source Code in GitHub
Get the source code from our git repository
Additional PySpark Resource & Reading Material
PySpark Frequentl Asked Question
Refer our PySpark FAQ space where important queries and informations are clarified. It also links to important PySpark Tutorial apges with-in site.
PySpark Examples Code
Find our GitHub Repository which list PySpark Example with code snippet
PySpark/Spark Related Interesting Blogs
Here are the list of informative blogs and related articles, which you might find interesting