Follow

Keep Up to Date with the Most Important News

By pressing the Subscribe button, you confirm that you have read and are agreeing to our Privacy Policy and Terms of Use
Contact

Pyspark collect() method gives different order when executing tests in Pycharm or Console

I have some tests on my pytest suite that compare dataframes with assert df1.collect() == df2.collect().

If I execute the code inside the Pychar IDE the tests passes, if I execute the tests in console an assertion error is raised.

After some debugging, I found that when I execute the test with the console the collected results are disordered.

MEDevel.com: Open-source for Healthcare and Education

Collecting and validating open-source software for healthcare, education, enterprise, development, medical imaging, medical records, and digital pathology.

Visit Medevel

For example, if my dataframe has two rows, this code will pass in Pycharm but it fails in console:

 assert df1.collect()[0] == df2.collect()[0]

And this one will fail in Pycharm but it will pass in console:

assert df1.collect()[1] == df2.collect()[0]

I’ve tried to invoke pytest with python3 -m pytest and just with pytest. Pycharm and the console are using the same venv

>Solution :

To my knowledge .collect() does not guarantee any order. Since the data is being sent to the driver from possibly multiple executors it could be that one executor is faster than the other. Instead of comparing single elements you should rather compare the lists as a whole if possible.

E.g.

assertCountEqual(df1.collect(), df2.collect())
Add a comment

Leave a Reply

Keep Up to Date with the Most Important News

By pressing the Subscribe button, you confirm that you have read and are agreeing to our Privacy Policy and Terms of Use

Discover more from Dev solutions

Subscribe now to keep reading and get access to the full archive.

Continue reading