Follow

Follow

Contact

Home PySpark add multiple columns based on categories from the other column

Questions

PySpark add multiple columns based on categories from the other column

byMR

March 24, 2022

I have a dataset that looks like this:

id | category | value
---+----------+------
1  | a        | 3
2  | a        | 3
3  | a        | 3
3  | b        | 1
4  | a        | 1
4  | b        | abc

The output I want is:

id | category_a | category_b
---+------------+--------
1  | 3          | null
2  | 3          | null
3  | 3          | 1
4  | 1          | abc

It means that it will groupBy id, category and creates dummy columns.

How can I transform the input to the expected output?

My approach is:

pivoted_df = df.groupBy("id") \
        .pivot("category") \
        .agg(F.lit(F.col("value")))

But I got this error:

pyspark.sql.utils.AnalysisException: Aggregate expression required for pivot, but '`value`' did not appear in any aggregate function.;

Update: The value column contains non-numeric value also.

For the category column, each id will have 2 rows only with respect to 2 categories a, b.

>Solution :

df = df.groupBy('id').pivot('category').agg(F.first('value'))

pyspark

byMR

Published March 24, 2022

Add a comment

Leave a ReplyCancel reply

Read more

Questions

How to sum where there's a condition from with different column

byMR

March 24, 2022

Questions

How to avoid write similar code multiple times in this situation?

byMR

March 24, 2022

Questions

Python input() not being called during function call

byMR

March 24, 2022

Questions

React: how to set `dagerouslySetInnerHTML` from component's state?

byMR

March 24, 2022

Questions

Update/append values in text file using Tkinter

byMR

March 24, 2022

Questions

Apply function to several rows and columns of a pandas dataframe using pd.loc

byMR

March 24, 2022