How filter dataframe by groupby second column in PySpark

November 30, 2021

I have a df with the columns, ‘households, people, flag’, and would like to filter the dataframe to households that contain at least one flag. I understand the logic but am not sure how to code it, can someone help? For the example below, the output would remove household 2.

logic:
df = df.filter(all rows in households where at least one row in that household contains 'flag'==1)

Example dataframe:
| Household| Person|flag|
| -------- | ----- | -- |
| 1        | Oliver|    |
| 1        | Jonny | 1  | 
| 2        | David |    |
| 2        | Mary  |    |
| 3        | Lizzie|    |
| 3        | Peter | 1  |

>Solution :

Filter and groupBy to get the desired Household and do inner join to get the final reault.

df.join(df.filter("flag = '1'").select('Household').distinct(), ['Household'], 'inner').show()

+---------+------+----+
|Household|Person|flag|
+---------+------+----+
|        1|Oliver|null|
|        1| Jonny|   1|
|        3|Lizzie|null|
|        3| Peter|   1|
+---------+------+----+