Keep duplicates with a condition in pyspark

I have a dataframe

data = [['p1', 0, 'dog'], ['p2', 0, 'dog'], ['p5', 1, 'dog'], ['p6', 1, 'cat'], ['p7', 1, 'dog'], ['p8', 2, 'cat'],['p3', 2, 'cat'], ['p4', 2, 'cat'],  ['p12', 3, 'cat'], 
       ['p9', 3, 'cat'], ['p10', 3, 'dog'], ['p11', 3, 'dog'], ['p13', 4, 'cat']]
sdf = spark.createDataFrame(data, schema = ['id', 'rank', 'group'])
sdf.show()
+---+----+-----+
| id|rank|group|
+---+----+-----+
| p1|   0|  dog|
| p2|   0|  dog|
| p5|   1|  dog|
| p6|   1|  cat|
| p7|   1|  dog|
| p8|   2|  cat|
| p3|   2|  cat|
| p4|   2|  cat|
|p12|   3|  cat|
| p9|   3|  cat|
|p10|   3|  dog|
|p11|   3|  dog|
|p13|   4|  cat|
+---+----+-----+

I have duplicates on the rank column and on the group column.I want to remove the rows where for the rank all group column are the same completely. If there is still a difference (even with a duplicate) in group column for rank, then this should be left

In the end, the final dataframe will be

 +---+----+-----+
| id|rank|group|
+---+----+-----+
| p5|   1|  dog|
| p6|   1|  cat|
| p7|   1|  dog|
|p12|   3|  cat|
| p9|   3|  cat|
|p10|   3|  dog|
|p11|   3|  dog|
|p13|   4|  cat|
+---+----+-----+

>Solution :

You can create another dataframe and apply your criteria on it, then join it back to the first dataframe on the rank column to keep only the desired ones:

tmpDf = sdf.groupby("rank").agg(countDistinct("group").alias("group"), count("rank").alias("count")) \
    .filter((col("group") > 1) | (col("count") == 1)).select("rank")
result = tmpDf.join(sdf, ["rank"], "left")
result.show()

+----+---+-----+
|rank| id|group|
+----+---+-----+
|   1| p7|  dog|
|   1| p6|  cat|
|   1| p5|  dog|
|   3|p11|  dog|
|   3|p10|  dog|
|   3| p9|  cat|
|   3|p12|  cat|
|   4|p13|  cat|
+----+---+-----+

Leave a Reply