Add 100 columns with random numbers to pyspark df

August 24, 2022

I try to do some statistical analysis on the dataset and need to create columns with random numbers to reaplce with original data to test for statistical significance. How do I add let’s say 100 columns to a pysaprk df with random numbers between 1-10’000?

>Solution :

You can use rand function that generates random number from 0.0 to 1.0 (generates decimals). Use that and multiply by 10000, then round it.

data_sdf.show()

# +---+---+---+---+
# | c1| c2| c3| c4|
# +---+---+---+---+
# |  1|  2|  3|  4|
# |  1|  2|  3|  4|
# |  1|  2|  3|  4|
# |  1|  2|  3|  4|
# +---+---+---+---+

replaced_data_sdf = data_sdf. \
    select(*[func.round(func.rand() * 10000, 0).alias(k) for k in data_sdf.columns])

replaced_data_sdf.show()

# +------+------+------+------+
# |    c1|    c2|    c3|    c4|
# +------+------+------+------+
# |8069.0|9059.0|8183.0|6829.0|
# |4114.0|4313.0| 146.0| 528.0|
# |7114.0|8282.0|8032.0|8458.0|
# |8279.0|1421.0|9506.0|2448.0|
# +------+------+------+------+