Follow

Follow

Contact

Home PySpark : How to aggregate on a column with count of the different

Questions

PySpark : How to aggregate on a column with count of the different

byMR

February 10, 2022

I want to aggregate on the Identifiant column with count of different state and represent all the state.

Identifiant	state
ID01	NY
ID02	NY
ID01	CA
ID03	CA
ID01	CA
ID03	NY
ID01	NY
ID01	CA
ID01	NY

I’d like to obtain this dataset:

Identifiant	NY	CA
ID01	3	3
ID02	1	0
ID03	1	1

>Solution :

Group by Identifiant and pivot State column:

from pyspark.sql import functions as F

result = (df.groupBy("Identifiant")
          .pivot("State")
          .count().na.fill(0)
          )

result.show()
#+-----------+---+---+
#|Identifiant| CA| NY|
#+-----------+---+---+
#|       ID03|  1|  1|
#|       ID01|  3|  3|
#|       ID02|  0|  1|
#+-----------+---+---+

apache-spark-sql

byMR

Published February 10, 2022

Add a comment

Leave a ReplyCancel reply

Read more

Questions

Parse through a JSON file and pick out certain properties

byMR

February 10, 2022

Questions

R: Problem involving subsetting one list based on another list and then finding position of the maxima

byMR

February 10, 2022

Questions

Functional recursive function with reduce

byMR

February 10, 2022

Questions

Database calls in Custom Authorization Handler returning a NullReference Error in Blazor Server

byMR

February 10, 2022

Questions

ActionBarDrawerToggle doesn't accept String as parameter

byMR

February 10, 2022

Questions

How to use the useState() with specific strings ("this" | "that")

byMR

February 10, 2022