I have a Spark dataframe:
> numbers_df
+----+-----------+-----------+-----------+-------------------------------------+
| id | num_1| num_2| num_3| all_num|
+----+-----------+-----------+-----------+-------------------------------------+
| 1| [1, 2, 5]| [4, 7]| [8, 3]| [1, 2, 3, 4, 5, 6, 7, 8, 9]|
| 2| [12, 13]| [10, 20]| [15, 17]| [10, 11, 12, 13, 14, 15, 16, 17, 18]|
+----+-----------+-----------+-----------+-------------------------------------+
I need to except from column all_num values of num_1, num_2 and num_3 columns.
Expected result:
| id | num_1 | num_2 | num_3 | all_num | except_num |
|---|---|---|---|---|---|
| 1 | [1, 2, 5] | [4, 7] | [8, 3] | [1, 2, 3, 4, 5, 6, 7, 8, 9] | [6, 9] |
| 2 | [12, 13] | [10, 16] | [15, 17] | [10, 11, 12, 13, 14, 15, 16, 17, 18] | [11, 14, 18] |
How can this be done in PySpark? Since array_except function takes only two columns as input
>Solution :
You can combine array_except and concat functions.
df = df.withColumn('except_num', F.array_except('all_num', F.concat('num_1', 'num_2', 'num_3')))
df.show(truncate=False)