Follow

Keep Up to Date with the Most Important News

By pressing the Subscribe button, you confirm that you have read and are agreeing to our Privacy Policy and Terms of Use
Contact

How to merge the list of words in PySpark dataframe?

I have a dataframe that contains a list of words and I need to merge them into a single sentence.

Dataframe:

temp = spark.createDataFrame([
    (0, ['Julia', 'is', 'awesome']),
    (2, ['Data-science', 'is','cool']),
    (3, ['Machine','learning'])
], ["id", "words"])

# +---+------------------------+
# |id |words                   |
# +---+------------------------+
# |0  |[Julia, is, awesome]    |
# |2  |[Data-science, is, cool]|
# |3  |[Machine, learning]     |
# +---+------------------------+

temp.printSchema()
# root
#  |-- id: long (nullable = true)
#  |-- words: array (nullable = true)
#  |    |-- element: string (containsNull = true)

I am applying the rdd.

MEDevel.com: Open-source for Healthcare and Education

Collecting and validating open-source software for healthcare, education, enterprise, development, medical imaging, medical records, and digital pathology.

Visit Medevel

rdd_df = temp.rdd.map(lambda x: [x['id'], ' '.join(x['words'])])
spark.createDataFrame(rdd_df, temp.schema).show(10, False)

# +---+---------------------------------------------------------+
# |id |words                                                    |
# +---+---------------------------------------------------------+
# |0  |[ ' J u l i a ' ,   ' i s ' ,   ' a w e s o m e ' ]      |
# |2  |[ ' D a t a - s c i e n c e ' ,   ' i s ' , ' c o o l ' ]|
# |3  |[ ' M a c h i n e ' , ' l e a r n i n g ' ]              |
# +---+---------------------------------------------------------+

But the above code is not returning the desired output. Is there any other solution that we can apply without the use of RDD?

Desired output:

+---+--------------------+
|id |words               |
+---+--------------------+
|0  |Julia is awesome    |
|1  |Data-science is cool|
|2  |Machine             |
+---+--------------------+

>Solution :

If you have a list of words (an array of strings), you can combine them using array_join:

from pyspark.sql import functions as F
temp = spark.createDataFrame([
    (0, ['Julia', 'is', 'awesome']),
    (1, ['Data-science', 'is','cool']),
    (2, ['Machine','learning'])
], ["id", "words"])

temp = temp.withColumn('words', F.array_join('words', ' '))

temp.show()
# +---+--------------------+
# | id|               words|
# +---+--------------------+
# |  0|    Julia is awesome|
# |  1|Data-science is cool|
# |  2|    Machine learning|
# +---+--------------------+
Add a comment

Leave a Reply

Keep Up to Date with the Most Important News

By pressing the Subscribe button, you confirm that you have read and are agreeing to our Privacy Policy and Terms of Use

Discover more from Dev solutions

Subscribe now to keep reading and get access to the full archive.

Continue reading