Python – If statement nested in For Loop is very slow on large dataframes

October 18, 2023

I have a dataframe which looks like:

     A      B
0    Plus   632
1    Min    24
2    Min    99
3    Plus   0.15
...  ...    ...

Now for each row where in Column A the value = ‘Min’, the value in Column B on the same row should be multiplied by -1.
For that I am using:

for i in range(1, len(df)):
    if df.loc[i, 'A'].startswith('Min'):
        df.loc[i, 'B'] = df.loc[i, 'B'] * -1

In essence this works, however with large dataframes it is extremely slow. With 50.000 rows, the processing time is 15 seconds, with 100.000 rows 30 seconds, 200.000 rows 160 seconds, and 300.000 500 seconds (exponential increase).

However, I am working with a dataframe of more than 700.000 rows and it just never finishes. Any suggestions for an alternative to speed this up?

>Solution :

Try to use vectorized operations and avoid using loops for row-by-row processing in pandas:

>>> import pandas as pd
>>> df = pd.DataFrame({'A': ['Plus', 'Min', 'Min', 'Plus'], 'B': [632, 24, 99, 0.15]})
>>> df
      A       B
0  Plus  632.00
1   Min   24.00
2   Min   99.00
3  Plus    0.15
>>> df.loc[df['A'].str.startswith('Min'), 'B'] *= -1
>>> df
      A       B
0  Plus  632.00
1   Min  -24.00
2   Min  -99.00
3  Plus    0.15

This should be significantly faster.