Follow

Keep Up to Date with the Most Important News

By pressing the Subscribe button, you confirm that you have read and are agreeing to our Privacy Policy and Terms of Use
Contact

Extract words from the text in Pyspark Dataframe

I have dataframe:

d = [{'text': 'They say that all cats land on their feet, but this does not apply to my cat. He not only often falls, but also jumps badly.', 'begin_end': [128, 139]},
    {'text': 'Mom called dad, and when he came home, he took moms car and drove to the store', 'begin_end': [20,31]}]
s = spark.createDataFrame(d)

----------+----------------------------------------------------------------------------------------------------------------------------+
|begin_end |text                                                                                                                        |
+----------+----------------------------------------------------------------------------------------------------------------------------+
|[111, 120]|They say that all cats land on their feet, but this does not apply to my cat. He not only often falls, but also jumps badly.|
|[20, 31]  |Mom called dad, and when he came home, he took moms car and drove to the store                                              |
+----------+----------------------------------------------------------------------------------------------------------------------------+

I needed to extract the words from the text column using the begin_end column array, like text[111:120+1]. In pandas, this could be done via zip:

df['new_col'] = [s[a:b+1] for s, (a,b) in zip(df['text'], df['begin_end'])]

result:

MEDevel.com: Open-source for Healthcare and Education

Collecting and validating open-source software for healthcare, education, enterprise, development, medical imaging, medical records, and digital pathology.

Visit Medevel

    begin_end     new_col
0   [111, 120]  jumps bad
1   [20, 31]    when he came

How can I rewrite zip function to pyspark and get new_col? Do I need to write a udf function for this?

>Solution :

You can do so by using substring in an expression. It expects the string you want to substring, a starting position and the length of the substring. An expression is needed as the substring function from pyspark.sql.functions doesn’t take a column as starting position or length.

Note that your first begin_end value is out of range of the text, so its empty.

s.withColumn('new_col', F.expr("substr(text, begin_end[0] + 1, begin_end[1] - begin_end[0] + 1)")).show()

+----------+--------------------+------------+
| begin_end|                text|     new_col|
+----------+--------------------+------------+
|[111, 120]|They say that all...|   jumps bad|
|  [20, 31]|Mom called dad, a...|when he came|
+----------+--------------------+------------+
Add a comment

Leave a Reply

Keep Up to Date with the Most Important News

By pressing the Subscribe button, you confirm that you have read and are agreeing to our Privacy Policy and Terms of Use

Discover more from Dev solutions

Subscribe now to keep reading and get access to the full archive.

Continue reading