pyspark split a Column of variable length Array type into two smaller arrays

If I have a Column of Array type of variable lengths such as:

  [ [1,2,3,4,6] ] 
  [ [0,4,5,4,6,8,9,1] ]  
  [ [1,2,3,4,6,2,4,5,6] ]  
  ...

How can I split this such that the first index is seperated from the rest such as:

  [ [1] ], [ [2,3,4,6] ] 
  [ [0] ], [ [4,5,4,6,8,9,1] ]  
  [ [1] ], [ [2,3,4,6,2,4,5,6] ] 

In pure python I might do something like:

  new_list = list[0]
  second_list = list[1:]  

>Solution :

In PySpark, you can achieve this transformation using the expr()

Code:-

from pyspark.sql import SparkSession
from pyspark.sql.functions import col, expr
spark = SparkSession.builder.appName("ArraySplitting").getOrCreate()
data = [
    [[1, 2, 3, 4, 6]],
    [[0, 4, 5, 4, 6, 8, 9, 1]],
    [[1, 2, 3, 4, 6, 2, 4, 5, 6]]
]
columns = ["ArrayColumn"]
df = spark.createDataFrame(data, columns)
df_split = df.withColumn("FirstIndex", expr("ArrayColumn[0]"))
df_split = df_split.withColumn("RestArray", expr("slice(ArrayColumn, 2, size(ArrayColumn))"))
df_split.show()

OUTPUT:-

+--------------------+----------+--------------------+
|         ArrayColumn|FirstIndex|           RestArray|
+--------------------+----------+--------------------+
|     [1, 2, 3, 4, 6]|         1|        [2, 3, 4, 6]|
|[0, 4, 5, 4, 6, 8...|         0|[4, 5, 4, 6, 8, 9...|
|[1, 2, 3, 4, 6, 2...|         1|[2, 3, 4, 6, 2, 4...|
+--------------------+----------+--------------------+

Leave a Reply