Pyspark calculate average of non-zero elements for each column

December 20, 2022

 from pyspark.sql import SparkSession
 from pyspark.sql import functions as F
 
 spark = SparkSession.builder.getOrCreate()
 df = spark.createDataFrame([(0.0, 1.2, -1.3), (0.0, 0.0, 0.0),
                             (-17.2, 20.3, 15.2), (23.4, 1.4, 0.0),], 
                             ['col1', 'col2', 'col3'])
 
 df1  = df.agg(F.avg('col1'))
 df2  = df.agg(F.avg('col2'))
 df3  = df.agg(F.avg('col3'))

If I have a dataframe,

ID COL1 COL2 COL3  
1  0.0     1.2    -1.3  
2  0.0     0.0     0.0  
3 -17.2   20.3    15,2
4  23.4   1.4     0.0

I want to calculate mean for each column.

   avg1 avg2 avg3
1   3.1  7.6  6.9

The result of above code is 1.54, 5.725, 3.47, which includes zero elements during averaging.

How can I do it?

>Solution :

None values are not affecting average so if you turn zero values to null you can have average of none zero values

(
    df
    .agg(
        F.avg(F.when(F.col('col1') == 0, None).otherwise(F.col('col1'))).alias('avg(col1)'),
        F.avg(F.when(F.col('col2') == 0, None).otherwise(F.col('col2'))).alias('avg(col2)'),
        F.avg(F.when(F.col('col3') == 0, None).otherwise(F.col('col3'))).alias('avg(col3)'))
).show()