Follow

Keep Up to Date with the Most Important News

By pressing the Subscribe button, you confirm that you have read and are agreeing to our Privacy Policy and Terms of Use
Contact

PySpark: Get first character of each word in string

For an assignment I have been asked to shorten the names of clients to only the first letter of each name where they are separated by a space character.

I found a lot of solutions for this in Python, but I am not able to translate this to a dataframe.

The DF looks like this:

MEDevel.com: Open-source for Healthcare and Education

Collecting and validating open-source software for healthcare, education, enterprise, development, medical imaging, medical records, and digital pathology.

Visit Medevel

| ID       | Name           | 
| -------- | -------------- |
| 1        | John Doe       |
| 2        | Roy Lee Winters|
| 3        | Mary-Kate Baron|

My desired output would be:

| ID | Name | Shortened_name| 
| -------- | -------- | -------------- |
| 1 | John Doe | JD |
| 2 | Roy Lee Winters | RLW |
| 3 | Mary-Kate Baron | MB |

I’ve had some result with the code below but this is not working when there are more then 2 names. I would also like to have some more ‘flexible’ code as some people have 4 or 5 names where others only have 1.

df.withColumn("col1", F.substring(F.split(F.col("Name"), " ").getItem(0), 1, 1))\
  .withColumn("col2", F.substring(F.split(F.col("Name"), " ").getItem(1), 1, 1))\
  .withColumn('Shortened_name', F.concat('col1', 'col2'))

>Solution :

You can split the Name column then use transform function on the resulting array to get first letter of each element:

from pyspark.sql import functions as F

df = spark.createDataFrame([(1, "John Doe"), (2, "Roy Lee Winters"), (3, "Mary-Kate Baron")], ["ID", "Name"])

df1 = df.withColumn(
    "Shortened_name",
    F.array_join(F.expr("transform(split(Name, ' '), x -> left(x, 1))"), "")
)

df1.show()
# +---+---------------+--------------+
# | ID|           Name|Shortened_name|
# +---+---------------+--------------+
# |  1|       John Doe|            JD|
# |  2|Roy Lee Winters|           RLW|
# |  3|Mary-Kate Baron|            MB|
# +---+---------------+--------------+

Or by using aggregate function:

df1 = df.withColumn(
    "Shortened_name",
    F.expr("aggregate(split(Name, ' '), '', (acc, x) -> acc || left(x, 1))")
)
Add a comment

Leave a Reply

Keep Up to Date with the Most Important News

By pressing the Subscribe button, you confirm that you have read and are agreeing to our Privacy Policy and Terms of Use

Discover more from Dev solutions

Subscribe now to keep reading and get access to the full archive.

Continue reading