Follow

Keep Up to Date with the Most Important News

By pressing the Subscribe button, you confirm that you have read and are agreeing to our Privacy Policy and Terms of Use
Contact

Missing data when ordering Pyspark Window

This is my current dataset:

from pyspark.sql import Window
import pyspark.sql.functions as psf

df = spark.createDataFrame([("2","1",1),
                            ("3","2",2)],
                     schema = StructType([StructField("Data",  StringType()),
                                          StructField("Source",StringType()),
                                          StructField("Date",  IntegerType())]))


display(df.withColumn("Result",psf.collect_set("Data").over(Window.partitionBy("Source").orderBy("Date"))))

Output:

Data Source Date Result
2 1 1 ["2"]
3 1 2 ["2","3"]

Why am I missing value 3 in the first row of column Result when using collect_set function over Window that is ordered ?

MEDevel.com: Open-source for Healthcare and Education

Collecting and validating open-source software for healthcare, education, enterprise, development, medical imaging, medical records, and digital pathology.

Visit Medevel

I have tried to use collect_list as well, but I am getting same results.

My desired output is:

Data Source Date Result
2 1 1 ["2","3"]
3 1 2 ["2","3"]

where the order of values in Result is preserved – first one is where Date = 1 and second one is Date = 2

>Solution :

You need to use a Window with unboundedPreceding and Window.unboundedFollowing:

Window.partitionBy("Source").orderBy("Date") \
  .rowsBetween(Window.unboundedPreceding, Window.unboundedFollowing)

By default Spark uses rowsBetween(Window.unboundedPreceding, Window.currentRow) when you have an orderBy

Add a comment

Leave a Reply

Keep Up to Date with the Most Important News

By pressing the Subscribe button, you confirm that you have read and are agreeing to our Privacy Policy and Terms of Use

Discover more from Dev solutions

Subscribe now to keep reading and get access to the full archive.

Continue reading