Follow

Keep Up to Date with the Most Important News

By pressing the Subscribe button, you confirm that you have read and are agreeing to our Privacy Policy and Terms of Use
Contact

Pyspark calculate average of non-zero elements for each column

 from pyspark.sql import SparkSession
 from pyspark.sql import functions as F
 
 spark = SparkSession.builder.getOrCreate()
 df = spark.createDataFrame([(0.0, 1.2, -1.3), (0.0, 0.0, 0.0),
                             (-17.2, 20.3, 15.2), (23.4, 1.4, 0.0),], 
                             ['col1', 'col2', 'col3'])
 
 df1  = df.agg(F.avg('col1'))
 df2  = df.agg(F.avg('col2'))
 df3  = df.agg(F.avg('col3'))

If I have a dataframe,

ID COL1 COL2 COL3  
1  0.0     1.2    -1.3  
2  0.0     0.0     0.0  
3 -17.2   20.3    15,2
4  23.4   1.4     0.0

I want to calculate mean for each column.

   avg1 avg2 avg3
1   3.1  7.6  6.9

The result of above code is 1.54, 5.725, 3.47, which includes zero elements during averaging.

MEDevel.com: Open-source for Healthcare and Education

Collecting and validating open-source software for healthcare, education, enterprise, development, medical imaging, medical records, and digital pathology.

Visit Medevel

How can I do it?

>Solution :

None values are not affecting average so if you turn zero values to null you can have average of none zero values

(
    df
    .agg(
        F.avg(F.when(F.col('col1') == 0, None).otherwise(F.col('col1'))).alias('avg(col1)'),
        F.avg(F.when(F.col('col2') == 0, None).otherwise(F.col('col2'))).alias('avg(col2)'),
        F.avg(F.when(F.col('col3') == 0, None).otherwise(F.col('col3'))).alias('avg(col3)'))
).show()
Add a comment

Leave a Reply

Keep Up to Date with the Most Important News

By pressing the Subscribe button, you confirm that you have read and are agreeing to our Privacy Policy and Terms of Use

Discover more from Dev solutions

Subscribe now to keep reading and get access to the full archive.

Continue reading