I am trying to extract the value from the MapType column in PySpark dataframe in the UDF function.
Below is the PySpark dataframe:
+-----------+------------+-------------+
|CUSTOMER_ID|col_a |col_b |
+-----------+------------+-------------+
| 100 |{0.0 -> 1.0}| {0.2 -> 1.0}|
| 101 |{0.0 -> 1.0}| {0.2 -> 1.0}|
| 102 |{0.0 -> 1.0}| {0.2 -> 1.0}|
| 103 |{0.0 -> 1.0}| {0.2 -> 1.0}|
| 104 |{0.0 -> 1.0}| {0.2 -> 1.0}|
| 105 |{0.0 -> 1.0}| {0.2 -> 1.0}|
+-----------+------------+-------------+
df.printSchema()
# root
# |-- CUSTOMER_ID: integer (nullable = true)
# |-- col_a: map (nullable = true)
# | |-- key: float
# | |-- value: float (valueContainsNull = true)
# |-- col_b: map (nullable = true)
# | |-- key: float
# | |-- value: float (valueContainsNull = true)
Below is the UDF
@F.udf(T.FloatType())
def test(col):
return col[1]
Below is the code:
df_temp=df_temp.withColumn('test',test(F.col('col_a')))
I am not getting the value from the col_a column when I pass it to the UDF. Can anyone explain this?
>Solution :
It’s because your map does not have anything at index=1.
df_temp = spark.createDataFrame([(100,),(101,),(102,)],['CUSTOMER_ID']) \
.withColumn('col_a', F.create_map(F.lit(0.0), F.lit(1.0)))
df_temp.show()
# +-----------+------------+
# |CUSTOMER_ID| col_a|
# +-----------+------------+
# | 100|{0.0 -> 1.0}|
# | 101|{0.0 -> 1.0}|
# | 102|{0.0 -> 1.0}|
# +-----------+------------+
df_temp = df_temp.withColumn('col_a_0', F.col('col_a')[0])
df_temp = df_temp.withColumn('col_a_1', F.col('col_a')[1])
df_temp.show()
# +-----------+------------+-------+-------+
# |CUSTOMER_ID| col_a|col_a_0|col_a_1|
# +-----------+------------+-------+-------+
# | 100|{0.0 -> 1.0}| 1.0| null|
# | 101|{0.0 -> 1.0}| 1.0| null|
# | 102|{0.0 -> 1.0}| 1.0| null|
# +-----------+------------+-------+-------+