I want to aggregate on the Identifiant column with count of different state and represent all the state.
| Identifiant | state |
|---|---|
| ID01 | NY |
| ID02 | NY |
| ID01 | CA |
| ID03 | CA |
| ID01 | CA |
| ID03 | NY |
| ID01 | NY |
| ID01 | CA |
| ID01 | NY |
I’d like to obtain this dataset:
| Identifiant | NY | CA |
|---|---|---|
| ID01 | 3 | 3 |
| ID02 | 1 | 0 |
| ID03 | 1 | 1 |
>Solution :
Group by Identifiant and pivot State column:
from pyspark.sql import functions as F
result = (df.groupBy("Identifiant")
.pivot("State")
.count().na.fill(0)
)
result.show()
#+-----------+---+---+
#|Identifiant| CA| NY|
#+-----------+---+---+
#| ID03| 1| 1|
#| ID01| 3| 3|
#| ID02| 0| 1|
#+-----------+---+---+