I have some tests on my pytest suite that compare dataframes with assert df1.collect() == df2.collect().
If I execute the code inside the Pychar IDE the tests passes, if I execute the tests in console an assertion error is raised.
After some debugging, I found that when I execute the test with the console the collected results are disordered.
For example, if my dataframe has two rows, this code will pass in Pycharm but it fails in console:
assert df1.collect()[0] == df2.collect()[0]
And this one will fail in Pycharm but it will pass in console:
assert df1.collect()[1] == df2.collect()[0]
I’ve tried to invoke pytest with python3 -m pytest and just with pytest. Pycharm and the console are using the same venv
>Solution :
To my knowledge .collect() does not guarantee any order. Since the data is being sent to the driver from possibly multiple executors it could be that one executor is faster than the other. Instead of comparing single elements you should rather compare the lists as a whole if possible.
E.g.
assertCountEqual(df1.collect(), df2.collect())