Some mock data:
test_data = [('1', '[tech, fx]'),
('2', '[industry, computer]'),
('3', '[5G, Apple]')]
test_data = spark.sparkContext.parallelize(test_data).toDF(['id', 'text'])
I wan’t to create a new column that indicates if the words [‘fx’, ‘computer’] can be found in the column called ‘text’:
Desired output:
result = [('1', '[tech, fx]', '1'),
('2', '[industry, computer]', '1'),
('3', '[5G, Apple]', '0')]
result = spark.sparkContext.parallelize(result).toDF(['id', 'text', 'indicator'])
>Solution :
You can do it using higher-order function exists
from pyspark.sql import functions as F
arr_col = F.split(F.expr("TRIM(BOTH '[]' FROM text)"), ", ?")
bln_col = F.exists(arr_col, lambda x: x.isin(['fx', 'computer']))
result = test_data.withColumn('indicator', bln_col.cast('int'))
result.show()
# +---+--------------------+---------+
# | id| text|indicator|
# +---+--------------------+---------+
# | 1| [tech, fx]| 1|
# | 2|[industry, computer]| 1|
# | 3| [5G, Apple]| 0|
# +---+--------------------+---------+
Since your "text" column was in string format, the arr_col expression transforms it to an array first. You need an array, as exists only accepts array type columns.
trim removes the symbols [] from both ends of the string. split splits on the delimiter which is a regex pattern , ?, which means that it splits on every comma , which could (or could not) be followed by one space.