I have a df with the columns, ‘households, people, flag’, and would like to filter the dataframe to households that contain at least one flag. I understand the logic but am not sure how to code it, can someone help? For the example below, the output would remove household 2.
logic:
df = df.filter(all rows in households where at least one row in that household contains 'flag'==1)
Example dataframe:
| Household| Person|flag|
| -------- | ----- | -- |
| 1 | Oliver| |
| 1 | Jonny | 1 |
| 2 | David | |
| 2 | Mary | |
| 3 | Lizzie| |
| 3 | Peter | 1 |
>Solution :
Filter and groupBy to get the desired Household and do inner join to get the final reault.
df.join(df.filter("flag = '1'").select('Household').distinct(), ['Household'], 'inner').show()
+---------+------+----+
|Household|Person|flag|
+---------+------+----+
| 1|Oliver|null|
| 1| Jonny| 1|
| 3|Lizzie|null|
| 3| Peter| 1|
+---------+------+----+