Column Pivot in PySpark
Hey, welcome back !
This post is about pivoting a column in PySpark. Or to put it in simpler terms, it is about splitting a column with categorical values, into multiple dummy variable columns (similar to One-Hot Encoding). The column has its distinct values transposed into individual columns. The best part is that this can be done in just one line !
Let's say that we have a database Classroom and a column within that, as Gender. This column contains values: Male / Female. We intend to split this column to 2 columns: Gender_Male and; Gender_Female (each containing values as 1 and 0). This can be achieved by the following line:
This post is about pivoting a column in PySpark. Or to put it in simpler terms, it is about splitting a column with categorical values, into multiple dummy variable columns (similar to One-Hot Encoding). The column has its distinct values transposed into individual columns. The best part is that this can be done in just one line !
Let's say that we have a database Classroom and a column within that, as Gender. This column contains values: Male / Female. We intend to split this column to 2 columns: Gender_Male and; Gender_Female (each containing values as 1 and 0). This can be achieved by the following line:
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
from pyspark.sql import functions as F | |
final_RDD = classroom_data.withColumn('ccol',F.concat(F.lit('Gender_'), classroom_data['Gender'])).groupby('Gender').pivot('ccol').agg(F.first('Gender')) |
Combining with GroupBy
Pivoting can also come in handy when we wish to group by certain columns and then perform a pivot to find the number of boys and girls per group. For example, let's say that we have a column "Age_Group", which has values 1,2,3. And now, we wish to know the number of boys and girls in every age group. The line below does just that:
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
user_age_group = classroom_data.groupby('StudentId', 'Age_Group').count() | |
final_RDD = classroom_data.withColumn('ccol',F.concat(F.lit('NumOf'), classroom_data['Gender'])).groupby('StudentId').pivot('ccol').agg(F.first('count')).fillna(0) |
Comments
Post a Comment