Column Pivot in PySpark

Hey, welcome back !

This post is about pivoting a column in PySpark. Or to put it in simpler terms, it is about splitting a column with categorical values, into multiple dummy variable columns (similar to One-Hot Encoding). The column has its distinct values transposed into individual columns. The best part is that this can be done in just one line !

Let's say that we have a database Classroom and a column within that, as Gender. This column contains values: Male / Female. We intend to split this column to 2 columns: Gender_Male and; Gender_Female (each containing values as 1 and 0). This can be achieved by the following line:
from pyspark.sql import functions as F
final_RDD = classroom_data.withColumn('ccol',F.concat(F.lit('Gender_'), classroom_data['Gender'])).groupby('Gender').pivot('ccol').agg(F.first('Gender'))
view raw blogpost_pivot1 hosted with ❤ by GitHub

Combining with GroupBy

Pivoting can also come in handy when we wish to group by certain columns and then perform a pivot to find the number of boys and girls per group. For example, let's say that we have a column "Age_Group", which has values 1,2,3. And now, we wish to know the number of boys and girls in every age group. The line below does just that:
user_age_group = classroom_data.groupby('StudentId', 'Age_Group').count()
final_RDD = classroom_data.withColumn('ccol',F.concat(F.lit('NumOf'), classroom_data['Gender'])).groupby('StudentId').pivot('ccol').agg(F.first('count')).fillna(0)
view raw blogpost_pivot2 hosted with ❤ by GitHub

Comments

Popular posts from this blog

Reading & Writing a file from / to Azure Blob Storage (Using PySpark)

Why PySpark is so Awesome !

Running Linux Commands on Azure Databricks (Using PySpark)