I have a pysark DataFrame looking like that:
df = spark.createDataFrame(
[(0, 'foo'),
(0, 'bar'),
(0, 'foo'),
(0, np.nan),
(1, 'bar'),
(1, 'foo'),
],
['group', 'value'])
df.show()
Out[1]:
group value
0 foo
0 bar
0 foo
0 None
1 bar
1 foo
I would like to add rows for each variant of column variant within each group as of col group and than fill up each additional row with that variant. As @samkart mentioned as there are 4 zeroes in group, there should be 4 foo and 4 bar values within the 0 group. None values should not be counted as additional variants so that the result looks like that:
group value
0 foo
0 foo
0 foo
0 foo
0 bar
0 bar
0 bar
0 bar
1 bar
1 bar
1 foo
1 foo
I experimented with counting the variants and than exploding the rows with
df = df.withColumn("n",func.expr("explode(array_repeat(n,int(n)))"),)
but I can’t figure out a way to fill the variant values in the desired way
>Solution :
You’re close. Here’s a working example using your input data.
data_sdf. \
withColumn('group_count',
func.count('group').over(wd.partitionBy('group')).cast('int')
). \
filter(func.col('value').isNotNull()). \
dropDuplicates(). \
withColumn('new_val_arr', func.expr('array_repeat(value, group_count)')). \
selectExpr('group', 'explode(new_val_arr) as value'). \
show()
# +-----+-----+
# |group|value|
# +-----+-----+
# | 0| foo|
# | 0| foo|
# | 0| foo|
# | 0| foo|
# | 0| bar|
# | 0| bar|
# | 0| bar|
# | 0| bar|
# | 1| bar|
# | 1| bar|
# | 1| foo|
# | 1| foo|
# +-----+-----+