What are the sorting order of a Pandas dataframe?

March 31, 2024

input1

import pandas as pd
import numpy as np
np.random.seed(0)
data = {'item': np.random.choice(['skirt', 'shirt', 'coat'], 6),
        'size': np.random.choice(['S', 'M', 'L', 'XL'], 6)}
df1 = pd.DataFrame(data)

df1:

    item    size
0   skirt   S
1   shirt   XL
2   skirt   L
3   shirt   S
4   shirt   S
5   coat    S

when i sort by size

df1.sort_values('size')

out:

    item    size
2   skirt   L
0   skirt   S
3   shirt   S
4   shirt   S
5   coat    S
1   shirt   XL

The data is sorted by the size column, and when the values of the size column are the same, the rows that were originally higher remain higher.

input2

import pandas as pd
import numpy as np
pd.options.display.max_rows = 6
np.random.seed(0)
data1 = {'item': np.random.choice(['skirt', 'shirt', 'coat'], 1000000),
        'size': np.random.choice(['S', 'M', 'L', 'XL'], 1000000)}
df2 = pd.DataFrame(data1)

df2

         item size
0       skirt    M
1       shirt    L
2       skirt    M
...       ...  ...
999997   coat    S
999998  shirt    S
999999  skirt    L

[1000000 rows x 2 columns]

df2 has 1M rows

when i sort by size

df2.sort_values('size')

out:

         item size
999999  skirt    L   <- why top?
645704  shirt    L
645714  shirt    L
...       ...  ...
822256   coat   XL
699230   coat   XL
400737  skirt   XL

[1000000 rows x 2 columns]

I don’t know why 999999 row is at the top in df2.

Shouldn’t the existing order be followed if size is the same?

>Solution :

What you want is a "stable" sort. "Stable" means it maintains the current order when the keys are identical. The default algorithm, quicksort, is not stable.