Follow

Keep Up to Date with the Most Important News

By pressing the Subscribe button, you confirm that you have read and are agreeing to our Privacy Policy and Terms of Use
Contact

Select values from MapType Column in UDF PySpark

I am trying to extract the value from the MapType column in PySpark dataframe in the UDF function.

Below is the PySpark dataframe:

+-----------+------------+-------------+
|CUSTOMER_ID|col_a       |col_b        |
+-----------+------------+-------------+
|    100    |{0.0 -> 1.0}| {0.2 -> 1.0}|
|    101    |{0.0 -> 1.0}| {0.2 -> 1.0}|
|    102    |{0.0 -> 1.0}| {0.2 -> 1.0}|
|    103    |{0.0 -> 1.0}| {0.2 -> 1.0}|
|    104    |{0.0 -> 1.0}| {0.2 -> 1.0}|
|    105    |{0.0 -> 1.0}| {0.2 -> 1.0}|
+-----------+------------+-------------+
df.printSchema()

# root
#  |-- CUSTOMER_ID: integer (nullable = true)
#  |-- col_a: map (nullable = true)
#  |    |-- key: float
#  |    |-- value: float (valueContainsNull = true)
#  |-- col_b: map (nullable = true)
#  |    |-- key: float
#  |    |-- value: float (valueContainsNull = true)

Below is the UDF

MEDevel.com: Open-source for Healthcare and Education

Collecting and validating open-source software for healthcare, education, enterprise, development, medical imaging, medical records, and digital pathology.

Visit Medevel

@F.udf(T.FloatType())
def test(col):
    return col[1]

Below is the code:

df_temp=df_temp.withColumn('test',test(F.col('col_a')))

I am not getting the value from the col_a column when I pass it to the UDF. Can anyone explain this?

>Solution :

It’s because your map does not have anything at index=1.

df_temp = spark.createDataFrame([(100,),(101,),(102,)],['CUSTOMER_ID']) \
          .withColumn('col_a', F.create_map(F.lit(0.0), F.lit(1.0)))
df_temp.show()
# +-----------+------------+
# |CUSTOMER_ID|       col_a|
# +-----------+------------+
# |        100|{0.0 -> 1.0}|
# |        101|{0.0 -> 1.0}|
# |        102|{0.0 -> 1.0}|
# +-----------+------------+

df_temp = df_temp.withColumn('col_a_0', F.col('col_a')[0])
df_temp = df_temp.withColumn('col_a_1', F.col('col_a')[1])

df_temp.show()
# +-----------+------------+-------+-------+
# |CUSTOMER_ID|       col_a|col_a_0|col_a_1|
# +-----------+------------+-------+-------+
# |        100|{0.0 -> 1.0}|    1.0|   null|
# |        101|{0.0 -> 1.0}|    1.0|   null|
# |        102|{0.0 -> 1.0}|    1.0|   null|
# +-----------+------------+-------+-------+
Add a comment

Leave a Reply

Keep Up to Date with the Most Important News

By pressing the Subscribe button, you confirm that you have read and are agreeing to our Privacy Policy and Terms of Use

Discover more from Dev solutions

Subscribe now to keep reading and get access to the full archive.

Continue reading