Home How to check if at least one element of a list is included in a list column?

Questions

How to check if at least one element of a list is included in a list column?

September 12, 2022

Some mock data:

test_data = [('1', '[tech, fx]'),
             ('2', '[industry, computer]'),
             ('3', '[5G, Apple]')]
test_data = spark.sparkContext.parallelize(test_data).toDF(['id', 'text'])

I wan’t to create a new column that indicates if the words [‘fx’, ‘computer’] can be found in the column called ‘text’:

Desired output:

result = [('1', '[tech, fx]', '1'),
          ('2', '[industry, computer]', '1'),
          ('3', '[5G, Apple]', '0')]
result = spark.sparkContext.parallelize(result).toDF(['id', 'text', 'indicator'])

>Solution :

You can do it using higher-order function exists

from pyspark.sql import functions as F

arr_col = F.split(F.expr("TRIM(BOTH '[]' FROM text)"), ", ?")
bln_col = F.exists(arr_col, lambda x: x.isin(['fx', 'computer']))
result = test_data.withColumn('indicator', bln_col.cast('int'))

result.show()
# +---+--------------------+---------+
# | id|                text|indicator|
# +---+--------------------+---------+
# |  1|          [tech, fx]|        1|
# |  2|[industry, computer]|        1|
# |  3|         [5G, Apple]|        0|
# +---+--------------------+---------+

Since your "text" column was in string format, the arr_col expression transforms it to an array first. You need an array, as exists only accepts array type columns.

trim removes the symbols [] from both ends of the string. split splits on the delimiter which is a regex pattern , ?, which means that it splits on every comma , which could (or could not) be followed by one space.