Follow

Keep Up to Date with the Most Important News

By pressing the Subscribe button, you confirm that you have read and are agreeing to our Privacy Policy and Terms of Use
Contact

How filter dataframe by groupby second column in PySpark

I have a df with the columns, ‘households, people, flag’, and would like to filter the dataframe to households that contain at least one flag. I understand the logic but am not sure how to code it, can someone help? For the example below, the output would remove household 2.

logic:
df = df.filter(all rows in households where at least one row in that household contains 'flag'==1)

Example dataframe:
| Household| Person|flag|
| -------- | ----- | -- |
| 1        | Oliver|    |
| 1        | Jonny | 1  | 
| 2        | David |    |
| 2        | Mary  |    |
| 3        | Lizzie|    |
| 3        | Peter | 1  |

MEDevel.com: Open-source for Healthcare and Education

Collecting and validating open-source software for healthcare, education, enterprise, development, medical imaging, medical records, and digital pathology.

Visit Medevel

>Solution :

Filter and groupBy to get the desired Household and do inner join to get the final reault.

df.join(df.filter("flag = '1'").select('Household').distinct(), ['Household'], 'inner').show()

+---------+------+----+
|Household|Person|flag|
+---------+------+----+
|        1|Oliver|null|
|        1| Jonny|   1|
|        3|Lizzie|null|
|        3| Peter|   1|
+---------+------+----+
Add a comment

Leave a Reply

Keep Up to Date with the Most Important News

By pressing the Subscribe button, you confirm that you have read and are agreeing to our Privacy Policy and Terms of Use

Discover more from Dev solutions

Subscribe now to keep reading and get access to the full archive.

Continue reading