from pyspark.sql import SparkSession
from pyspark.sql import functions as F
spark = SparkSession.builder.getOrCreate()
df = spark.createDataFrame([(0.0, 1.2, -1.3), (0.0, 0.0, 0.0),
(-17.2, 20.3, 15.2), (23.4, 1.4, 0.0),],
['col1', 'col2', 'col3'])
df1 = df.agg(F.avg('col1'))
df2 = df.agg(F.avg('col2'))
df3 = df.agg(F.avg('col3'))
If I have a dataframe,
ID COL1 COL2 COL3
1 0.0 1.2 -1.3
2 0.0 0.0 0.0
3 -17.2 20.3 15,2
4 23.4 1.4 0.0
I want to calculate mean for each column.
avg1 avg2 avg3
1 3.1 7.6 6.9
The result of above code is 1.54, 5.725, 3.47, which includes zero elements during averaging.
How can I do it?
>Solution :
None values are not affecting average so if you turn zero values to null you can have average of none zero values
(
df
.agg(
F.avg(F.when(F.col('col1') == 0, None).otherwise(F.col('col1'))).alias('avg(col1)'),
F.avg(F.when(F.col('col2') == 0, None).otherwise(F.col('col2'))).alias('avg(col2)'),
F.avg(F.when(F.col('col3') == 0, None).otherwise(F.col('col3'))).alias('avg(col3)'))
).show()