I have a dataframe
data = [['p1', 0, 'dog'], ['p2', 0, 'dog'], ['p5', 1, 'dog'], ['p6', 1, 'cat'], ['p7', 1, 'dog'], ['p8', 2, 'cat'],['p3', 2, 'cat'], ['p4', 2, 'cat'], ['p12', 3, 'cat'],
['p9', 3, 'cat'], ['p10', 3, 'dog'], ['p11', 3, 'dog'], ['p13', 4, 'cat']]
sdf = spark.createDataFrame(data, schema = ['id', 'rank', 'group'])
sdf.show()
+---+----+-----+
| id|rank|group|
+---+----+-----+
| p1| 0| dog|
| p2| 0| dog|
| p5| 1| dog|
| p6| 1| cat|
| p7| 1| dog|
| p8| 2| cat|
| p3| 2| cat|
| p4| 2| cat|
|p12| 3| cat|
| p9| 3| cat|
|p10| 3| dog|
|p11| 3| dog|
|p13| 4| cat|
+---+----+-----+
I have duplicates on the rank column and on the group column.I want to remove the rows where for the rank all group column
are the same completely. If there is still a difference (even with a duplicate) in group
column for rank
, then this should be left
In the end, the final dataframe will be
+---+----+-----+
| id|rank|group|
+---+----+-----+
| p5| 1| dog|
| p6| 1| cat|
| p7| 1| dog|
|p12| 3| cat|
| p9| 3| cat|
|p10| 3| dog|
|p11| 3| dog|
|p13| 4| cat|
+---+----+-----+
>Solution :
You can create another dataframe and apply your criteria on it, then join it back to the first dataframe on the rank column to keep only the desired ones:
tmpDf = sdf.groupby("rank").agg(countDistinct("group").alias("group"), count("rank").alias("count")) \
.filter((col("group") > 1) | (col("count") == 1)).select("rank")
result = tmpDf.join(sdf, ["rank"], "left")
result.show()
+----+---+-----+
|rank| id|group|
+----+---+-----+
| 1| p7| dog|
| 1| p6| cat|
| 1| p5| dog|
| 3|p11| dog|
| 3|p10| dog|
| 3| p9| cat|
| 3|p12| cat|
| 4|p13| cat|
+----+---+-----+