If I have a Column of Array type of variable lengths such as:
[ [1,2,3,4,6] ]
[ [0,4,5,4,6,8,9,1] ]
[ [1,2,3,4,6,2,4,5,6] ]
...
How can I split this such that the first index is seperated from the rest such as:
[ [1] ], [ [2,3,4,6] ]
[ [0] ], [ [4,5,4,6,8,9,1] ]
[ [1] ], [ [2,3,4,6,2,4,5,6] ]
In pure python I might do something like:
new_list = list[0]
second_list = list[1:]
>Solution :
In PySpark, you can achieve this transformation using the expr()
Code:-
from pyspark.sql import SparkSession
from pyspark.sql.functions import col, expr
spark = SparkSession.builder.appName("ArraySplitting").getOrCreate()
data = [
[[1, 2, 3, 4, 6]],
[[0, 4, 5, 4, 6, 8, 9, 1]],
[[1, 2, 3, 4, 6, 2, 4, 5, 6]]
]
columns = ["ArrayColumn"]
df = spark.createDataFrame(data, columns)
df_split = df.withColumn("FirstIndex", expr("ArrayColumn[0]"))
df_split = df_split.withColumn("RestArray", expr("slice(ArrayColumn, 2, size(ArrayColumn))"))
df_split.show()
OUTPUT:-
+--------------------+----------+--------------------+
| ArrayColumn|FirstIndex| RestArray|
+--------------------+----------+--------------------+
| [1, 2, 3, 4, 6]| 1| [2, 3, 4, 6]|
|[0, 4, 5, 4, 6, 8...| 0|[4, 5, 4, 6, 8, 9...|
|[1, 2, 3, 4, 6, 2...| 1|[2, 3, 4, 6, 2, 4...|
+--------------------+----------+--------------------+