Home multiplying group of columns for each unique variant in a column and fill all rows of the columns with that value

Questions

multiplying group of columns for each unique variant in a column and fill all rows of the columns with that value

August 12, 2022

I have a pysark DataFrame looking like that:

df = spark.createDataFrame(
    [(0, 'foo'),
     (0, 'bar'),
     (0, 'foo'),
     (0, np.nan),
     (1, 'bar'),
     (1, 'foo'),
     ],
    ['group', 'value'])

df.show()
Out[1]:
group value 
0     foo     
0     bar     
0     foo 
0     None    
1     bar     
1     foo

I would like to add rows for each variant of column variant within each group as of col group and than fill up each additional row with that variant. As @samkart mentioned as there are 4 zeroes in group, there should be 4 foo and 4 bar values within the 0 group. None values should not be counted as additional variants so that the result looks like that:

group value 
0     foo     
0     foo     
0     foo 
0     foo 
0     bar     
0     bar     
0     bar 
0     bar   
1     bar     
1     bar   
1     foo     
1     foo

I experimented with counting the variants and than exploding the rows with

df = df.withColumn("n",func.expr("explode(array_repeat(n,int(n)))"),)

but I can’t figure out a way to fill the variant values in the desired way

>Solution :

You’re close. Here’s a working example using your input data.

data_sdf. \
    withColumn('group_count', 
               func.count('group').over(wd.partitionBy('group')).cast('int')
               ). \
    filter(func.col('value').isNotNull()). \
    dropDuplicates(). \
    withColumn('new_val_arr', func.expr('array_repeat(value, group_count)')). \
    selectExpr('group', 'explode(new_val_arr) as value'). \
    show()

# +-----+-----+
# |group|value|
# +-----+-----+
# |    0|  foo|
# |    0|  foo|
# |    0|  foo|
# |    0|  foo|
# |    0|  bar|
# |    0|  bar|
# |    0|  bar|
# |    0|  bar|
# |    1|  bar|
# |    1|  bar|
# |    1|  foo|
# |    1|  foo|
# +-----+-----+