Extract words from the text in Pyspark Dataframe

July 8, 2022

I have dataframe:

d = [{'text': 'They say that all cats land on their feet, but this does not apply to my cat. He not only often falls, but also jumps badly.', 'begin_end': [128, 139]},
    {'text': 'Mom called dad, and when he came home, he took moms car and drove to the store', 'begin_end': [20,31]}]
s = spark.createDataFrame(d)

----------+----------------------------------------------------------------------------------------------------------------------------+
|begin_end |text                                                                                                                        |
+----------+----------------------------------------------------------------------------------------------------------------------------+
|[111, 120]|They say that all cats land on their feet, but this does not apply to my cat. He not only often falls, but also jumps badly.|
|[20, 31]  |Mom called dad, and when he came home, he took moms car and drove to the store                                              |
+----------+----------------------------------------------------------------------------------------------------------------------------+

I needed to extract the words from the text column using the begin_end column array, like text[111:120+1]. In pandas, this could be done via zip:

df['new_col'] = [s[a:b+1] for s, (a,b) in zip(df['text'], df['begin_end'])]

result:

    begin_end     new_col
0   [111, 120]  jumps bad
1   [20, 31]    when he came

How can I rewrite zip function to pyspark and get new_col? Do I need to write a udf function for this?

>Solution :

You can do so by using substring in an expression. It expects the string you want to substring, a starting position and the length of the substring. An expression is needed as the substring function from pyspark.sql.functions doesn’t take a column as starting position or length.

Note that your first begin_end value is out of range of the text, so its empty.

s.withColumn('new_col', F.expr("substr(text, begin_end[0] + 1, begin_end[1] - begin_end[0] + 1)")).show()

+----------+--------------------+------------+
| begin_end|                text|     new_col|
+----------+--------------------+------------+
|[111, 120]|They say that all...|   jumps bad|
|  [20, 31]|Mom called dad, a...|when he came|
+----------+--------------------+------------+