Home Pyspark collect() method gives different order when executing tests in Pycharm or Console

Questions

Pyspark collect() method gives different order when executing tests in Pycharm or Console

July 1, 2022

I have some tests on my pytest suite that compare dataframes with assert df1.collect() == df2.collect().

If I execute the code inside the Pychar IDE the tests passes, if I execute the tests in console an assertion error is raised.

After some debugging, I found that when I execute the test with the console the collected results are disordered.

For example, if my dataframe has two rows, this code will pass in Pycharm but it fails in console:

 assert df1.collect()[0] == df2.collect()[0]

And this one will fail in Pycharm but it will pass in console:

assert df1.collect()[1] == df2.collect()[0]

I’ve tried to invoke pytest with python3 -m pytest and just with pytest. Pycharm and the console are using the same venv

>Solution :

To my knowledge .collect() does not guarantee any order. Since the data is being sent to the driver from possibly multiple executors it could be that one executor is faster than the other. Instead of comparing single elements you should rather compare the lists as a whole if possible.

E.g.

assertCountEqual(df1.collect(), df2.collect())

pytest

byMR

Published July 01, 2022

Add a comment

Replicate X times specific rows of pandas dataframe

byMR

July 1, 2022

Questions

Why the xpath method is not working here?

byMR

July 1, 2022

Questions

How to get element based on their innerHTML

byMR

July 1, 2022

Questions

Getting last X records from CSV file using bash

byMR

July 1, 2022

Questions

Can't understand why I am getting tuple error even though I typed everything correctly

byMR

July 1, 2022

Questions

Current setting in postgresql

byMR

July 1, 2022

Pyspark collect() method gives different order when executing tests in Pycharm or Console

MEDevel.com: Open-source for Healthcare and Education

>Solution :

Like this:

Leave a ReplyCancel reply

Read more

Replicate X times specific rows of pandas dataframe

Why the xpath method is not working here?

How to get element based on their innerHTML

Getting last X records from CSV file using bash

Can't understand why I am getting tuple error even though I typed everything correctly

Current setting in postgresql

Keep Up to Date with the Most Important News

Pyspark collect() method gives different order when executing tests in Pycharm or Console

MEDevel.com: Open-source for Healthcare and Education

>Solution :

Share this:

Like this:

Leave a ReplyCancel reply

Keep Up to Date with the Most Important News

Read more

Replicate X times specific rows of pandas dataframe

Why the xpath method is not working here?

How to get element based on their innerHTML

Getting last X records from CSV file using bash

Can't understand why I am getting tuple error even though I typed everything correctly

Current setting in postgresql

Discover more from Dev solutions