I try to do some statistical analysis on the dataset and need to create columns with random numbers to reaplce with original data to test for statistical significance. How do I add let’s say 100 columns to a pysaprk df with random numbers between 1-10’000?
>Solution :
You can use rand function that generates random number from 0.0 to 1.0 (generates decimals). Use that and multiply by 10000, then round it.
data_sdf.show()
# +---+---+---+---+
# | c1| c2| c3| c4|
# +---+---+---+---+
# | 1| 2| 3| 4|
# | 1| 2| 3| 4|
# | 1| 2| 3| 4|
# | 1| 2| 3| 4|
# +---+---+---+---+
replaced_data_sdf = data_sdf. \
select(*[func.round(func.rand() * 10000, 0).alias(k) for k in data_sdf.columns])
replaced_data_sdf.show()
# +------+------+------+------+
# | c1| c2| c3| c4|
# +------+------+------+------+
# |8069.0|9059.0|8183.0|6829.0|
# |4114.0|4313.0| 146.0| 528.0|
# |7114.0|8282.0|8032.0|8458.0|
# |8279.0|1421.0|9506.0|2448.0|
# +------+------+------+------+