What would be a Pyspark equivalent of the SQL statement NOT IN

April 4, 2022

What would be the equivalent code in PySpark?

If I have table A and Table B, and I want to select certain ID from Table A which is not in Table B, I can do the following SQL command:

Select ID
from Table A
where ID not in (Select ID from Table B)

What would be the equivalent code in PySpark?

>Solution :

You could do an "anti-join" with the option "anti":

A_df.show()
# +-----+---+
# | type| id|
# +-----+---+
# |type1| 10|
# |type2| 20|
# +-----+---+


B_df.show()
# +---+-----+----+
# | id| name|type|
# +---+-----+----+
# |  1|name1|  10|
# |  2|name2|  30|
# |  3|name3|  20|
# +---+-----+----+


B_df.join(A_df, B_df.type == A_df.id, "anti").show()
# +---+-----+----+
# | id| name|type|
# +---+-----+----+
# |  2|name2|  30|
# +---+-----+----+

Thsi would be equivalent to select * from B_df where type not in (select id from A_df