Follow

Keep Up to Date with the Most Important News

By pressing the Subscribe button, you confirm that you have read and are agreeing to our Privacy Policy and Terms of Use
Contact

PySpark : How to aggregate on a column with count of the different

I want to aggregate on the Identifiant column with count of different state and represent all the state.

Identifiant state
ID01 NY
ID02 NY
ID01 CA
ID03 CA
ID01 CA
ID03 NY
ID01 NY
ID01 CA
ID01 NY

I’d like to obtain this dataset:

Identifiant NY CA
ID01 3 3
ID02 1 0
ID03 1 1

MEDevel.com: Open-source for Healthcare and Education

Collecting and validating open-source software for healthcare, education, enterprise, development, medical imaging, medical records, and digital pathology.

Visit Medevel

>Solution :

Group by Identifiant and pivot State column:

from pyspark.sql import functions as F

result = (df.groupBy("Identifiant")
          .pivot("State")
          .count().na.fill(0)
          )

result.show()
#+-----------+---+---+
#|Identifiant| CA| NY|
#+-----------+---+---+
#|       ID03|  1|  1|
#|       ID01|  3|  3|
#|       ID02|  0|  1|
#+-----------+---+---+
Add a comment

Leave a Reply

Keep Up to Date with the Most Important News

By pressing the Subscribe button, you confirm that you have read and are agreeing to our Privacy Policy and Terms of Use

Discover more from Dev solutions

Subscribe now to keep reading and get access to the full archive.

Continue reading