Follow

Follow

Contact

Home How to get required dataframe after pyspark pivot?

Questions

How to get required dataframe after pyspark pivot?

byMR

March 9, 2022

I have a spark dataframe:

+------+----------------+-------+------+---------+
|name  |           pDate| status|user  |userCount|
+------+----------------+-------+------+---------+
|janani|2022-03-04      |active |qwe123|1        |
|raj   |2022-03-03      |active |qwe123|1        |
|ram   |2022-03-01      |active |qwe123|1        |
|ram   |2022-03-02      |active |qwe123|1        |
+------+----------------+-------+------+---------+

when I pivoted the dataframe with the following code

pvtcountuser = countuser.groupby('pDate','name').pivot('pDate').max('userCount').fillna(0)

I get:

+----------------+------+----------+----------+----------+----------+
|pDate           |name  |2022-03-01|2022-03-02|2022-03-03|2022-03-04|
+----------------+------+----------+----------+----------+----------+
|2022-03-04      |janani|0         |0         |0         |1         |
|2022-03-03      |raj   |0         |0         |1         |0         |
|2022-03-01      |ram   |1         |0         |0         |0         |
|2022-03-02      |ram   |0         |1         |0         |0         |
+----------------+------+----------+----------+----------+----------+

but the required solution needed is:

+----------------+------+----------+----------+----------+----------+
|pDate           |name  |2022-03-01|2022-03-02|2022-03-03|2022-03-04|
+----------------+------+----------+----------+----------+----------+
|2022-03-04      |janani|0         |0         |0         |1         |
|2022-03-03      |raj   |0         |0         |1         |0         |
|2022-03-01      |ram   |1         |1         |0         |0         |
+----------------+------+----------+----------+----------+----------+

Please help on this

>Solution :

Since you want a record per name, the group by should be done only on name

sql_stmt = '''
select *
from   values    ('janani',date '2022-03-04','active','qwe123',1)
                ,('raj'   ,date '2022-03-03','active','qwe123',1)
                ,('ram'   ,date '2022-03-01','active','qwe123',1)
                ,('ram'   ,date '2022-03-02','active','qwe123',1) as t (name,pDate,status,user,userCount)
'''

countuser = spark.sql(sql_stmt)

pvtcountuser = countuser.groupby('name').pivot('pDate').max('userCount').fillna(0)
pvtcountuser.show()

+------+----------+----------+----------+----------+
|  name|2022-03-01|2022-03-02|2022-03-03|2022-03-04|
+------+----------+----------+----------+----------+
|   raj|         0|         0|         1|         0|
|   ram|         1|         1|         0|         0|
|janani|         0|         0|         0|         1|
+------+----------+----------+----------+----------+

apache-spark-sql

byMR

Published March 09, 2022

Add a comment

Leave a ReplyCancel reply

Read more

Questions

How to repeat a letter in a word n times?

byMR

March 9, 2022

Questions

Add object to observable collection based on created class

byMR

March 9, 2022

Questions

ASP.NET Core WebApi Return All Values by Attribute

byMR

March 9, 2022

Questions

Standard hash with variadic template

byMR

March 9, 2022

Questions

Python global keyword weird behaviour

byMR

March 9, 2022

Questions

How do I remove one specific <li> element from html using basic javascript?

byMR

March 9, 2022