I want to remove the double brackets after collect_set ?
Input data :
DF = [('1', '[132]'),
('1', '[184, 88]'),
('2', '[55]'),
('2', '[123,33]'),]
DF = spark.sparkContext.parallelize(DF).toDF(['id', 'codes'])
DF.groupBy("id").agg(F.collect_set("codes").alias("codes_concat")).show(4)
+---+------------------+
| id| codes_concat|
+---+------------------+
| 1|[[184, 88], [132]]|
| 2| [[123,33], [55]]|
+---+------------------+
How do I get a simple list instead:
+---+------------------+
| id| codes_concat|
+---+------------------+
| 1| [184, 88, 132] |
| 2| [123,33, 55] |
+---+------------------+
>Solution :
You can use the translate function to remove the [ and ] first, and then use the collect_set function to aggregate.
DF.groupBy("id").agg(F.collect_set(F.translate("codes", "[]", "")).alias("codes_concat")).show(4)