Follow

Keep Up to Date with the Most Important News

By pressing the Subscribe button, you confirm that you have read and are agreeing to our Privacy Policy and Terms of Use
Contact

How to convert a column from string to array in PySpark

I have a dataframe converted from an inherited dataset which looks like the following:

data = [("[]","2000","M",False),
            ("[{'username':'aabb','points':'200','active':'true'}, {'username':'bbaa22','points':'0','active':'false'}]","1999","F",True),
            ("[{'username':'topuser','points':'855','active':'false'}]","1974","M",True),
            ("[]","2005","F",False),
            ("[{'username':'myprimary','points':'1050','active':'true'}, {'username':'mylurk','points':'100','active':'true'}, {'username':'closedlurk','points':'50','active':'false'}]","1992","M",True)
            ]

columns=["user","dob_year","gender","member"]
df=spark.createDataFrame(data,columns)
+--------------------+--------+------+------+
|                user|dob_year|gender|member|
+--------------------+--------+------+------+
|                  []|    2000|     M| false|
|[{'username':'aab...|    1999|     F|  true|
|[{'username':'top...|    1974|     M|  true|
|                  []|    2005|     F| false|
|[{'username':'myp...|    1992|     M|  true|
+--------------------+--------+------+------+

I need to extract some of the elements from the user column and I attempt to use the pyspark explode function.

from pyspark.sql.functions import explode

df2 = df.select(explode(df.user), df.dob_year)

When I attempt this, I’m met with the following error:

MEDevel.com: Open-source for Healthcare and Education

Collecting and validating open-source software for healthcare, education, enterprise, development, medical imaging, medical records, and digital pathology.

Visit Medevel

AnalysisException: cannot resolve ‘explode(user)’ due to data type mismatch: input to function explode should be array or map type, not string;

When I run df.printSchema(), I realize that the user column is string, rather than list as desired.

I also attempted to cast the strings in the column to arrays by creating a UDF

import pyspark.sql.functions as f

df2 = df.withColumn("user", str2list_udf(f.col("user"))).withColumn("user",df.user.cast(ArrayType(StringType())))

When I do that, I’m met with the following error:

AnalysisException: cannot resolve ‘user‘ due to data type mismatch: cannot cast string to array;

How can the data in this column be cast or converted into an array so that the explode function can be leveraged and individual keys parsed out into their own columns (example: having individual columns for username, points and active)? If explode isn’t the best method, is there another route I should follow?

If it’s helpful, here is the UDF that I created:

def str2list(x):
  if x == '[]':
    return list()
  else:
    return list(x)

str2list_udf = udf(lambda x: str2list(x))

Thank you

>Solution :

Convert the stringified arrays into array of structs using from_json the explode the resulting array:

from pyspark.sql import functions as F
from pyspark.sql.types import StructField, ArrayType, StringType, StructType

user_schema = ArrayType(
    StructType([
        StructField("username", StringType(), True),
        StructField("points", StringType(), True),
        StructField("active", StringType(), True)
    ])
)

df1 = (df.withColumn("user", F.from_json("user", user_schema))
       .selectExpr("inline(user)", "dob_year", "gender", "member")
       )

df1.show()
#+----------+------+------+--------+------+------+
#|  username|points|active|dob_year|gender|member|
#+----------+------+------+--------+------+------+
#|      aabb|   200|  true|    1999|     F|  true|
#|    bbaa22|     0| false|    1999|     F|  true|
#|   topuser|   855| false|    1974|     M|  true|
#| myprimary|  1050|  true|    1992|     M|  true|
#|    mylurk|   100|  true|    1992|     M|  true|
#|closedlurk|    50| false|    1992|     M|  true|
#+----------+------+------+--------+------+------+
Add a comment

Leave a Reply

Keep Up to Date with the Most Important News

By pressing the Subscribe button, you confirm that you have read and are agreeing to our Privacy Policy and Terms of Use

Discover more from Dev solutions

Subscribe now to keep reading and get access to the full archive.

Continue reading