What would be the equivalent code in PySpark?
If I have table A and Table B, and I want to select certain ID from Table A which is not in Table B, I can do the following SQL command:
Select ID
from Table A
where ID not in (Select ID from Table B)
What would be the equivalent code in PySpark?
>Solution :
You could do an "anti-join" with the option "anti":
A_df.show()
# +-----+---+
# | type| id|
# +-----+---+
# |type1| 10|
# |type2| 20|
# +-----+---+
B_df.show()
# +---+-----+----+
# | id| name|type|
# +---+-----+----+
# | 1|name1| 10|
# | 2|name2| 30|
# | 3|name3| 20|
# +---+-----+----+
B_df.join(A_df, B_df.type == A_df.id, "anti").show()
# +---+-----+----+
# | id| name|type|
# +---+-----+----+
# | 2|name2| 30|
# +---+-----+----+
Thsi would be equivalent to select * from B_df where type not in (select id from A_df