I have a field interest_product_id which looks something like below –
a.select('cust_id', 'interest_product_id').show(1,False)
+---------------+----------------------------------------------+
|cust_id |interest_product_id |
+---------------+----------------------------------------------+
|4308c3w994 |[[73ndy0-885bns-ysrd, isgbf-6322-734f4-92j72]]|
+---------------+----------------------------------------------+
The schema is as below –
root
|-- cust_id: string (nullable = true)
|-- interest_product_id: array (nullable = true)
| |-- element: array (containsNull = true)
| | |-- element: string (containsNull = true)
As the field interest_product_id is of array type and also the element is array(string) the field shows [[**]]. How can I convert it to a array(string)??
Expected outcome –
+---------------+----------------------------------------------+
|cust_id |interest_product_id |
+---------------+----------------------------------------------+
|4308c3w994 |[73ndy0-885bns-ysrd, isgbf-6322-734f4-92j72] |
+---------------+----------------------------------------------+
Please suggest the best way. Thanks!!
>Solution :
flatten, creates a flat array from nested arrays.
from pyspark.sql import functions as F
df = spark.createDataFrame([("4308c3w994", [["73ndy0-885bns-ysrd", "isgbf-6322-734f4-92j72"]], )], ("cust_id", "interest_product_id", ))
df.withColumn("interest_product_id", F.flatten(F.col("interest_product_id"))).show(truncate=False)
Output
+----------+--------------------------------------------+
|cust_id |interest_product_id |
+----------+--------------------------------------------+
|4308c3w994|[73ndy0-885bns-ysrd, isgbf-6322-734f4-92j72]|
+----------+--------------------------------------------+