Follow

Keep Up to Date with the Most Important News

By pressing the Subscribe button, you confirm that you have read and are agreeing to our Privacy Policy and Terms of Use
Contact

PySpark add multiple columns based on categories from the other column

I have a dataset that looks like this:

id | category | value
---+----------+------
1  | a        | 3
2  | a        | 3
3  | a        | 3
3  | b        | 1
4  | a        | 1
4  | b        | abc

The output I want is:

id | category_a | category_b
---+------------+--------
1  | 3          | null
2  | 3          | null
3  | 3          | 1
4  | 1          | abc

It means that it will groupBy id, category and creates dummy columns.

MEDevel.com: Open-source for Healthcare and Education

Collecting and validating open-source software for healthcare, education, enterprise, development, medical imaging, medical records, and digital pathology.

Visit Medevel

How can I transform the input to the expected output?

My approach is:

pivoted_df = df.groupBy("id") \
        .pivot("category") \
        .agg(F.lit(F.col("value")))

But I got this error:

pyspark.sql.utils.AnalysisException: Aggregate expression required for pivot, but '`value`' did not appear in any aggregate function.;

Update: The value column contains non-numeric value also.

For the category column, each id will have 2 rows only with respect to 2 categories a, b.

>Solution :

df = df.groupBy('id').pivot('category').agg(F.first('value'))

Add a comment

Leave a Reply

Keep Up to Date with the Most Important News

By pressing the Subscribe button, you confirm that you have read and are agreeing to our Privacy Policy and Terms of Use

Discover more from Dev solutions

Subscribe now to keep reading and get access to the full archive.

Continue reading