Efficiently applying a function on whole dataframe column

April 5, 2022

i want to apply a function to the columns of this dataframe:

df = pd.DataFrame({"a":[0,1,2,3,4,5,6], "b": [10,100,100,100,100,1000,1]})

def scale(values):
    scaled_values = (values - np.min(values)) / (np.max(values) - np.min(values))
    return scaled_values / np.sum(scaled_values)

This function requires to consider all values in the column at once. Thus i cannot apply the function with the "apply" method of pandas. Thus at the moment i just call

df.loc[:,"a"] = scale(df["a"])
df.loc[:,"b"] = scale(df["b"])

However, if i have a lot of columns i don’t want to do it like this and also looping over the columns is quite ugly:

for c in df.columns:
  df.loc[:,c] = scale(df[c])

I’m wondering if there is a pandas method to "apply" a method to the whole column at once so that i can get rid of this ugly loop. Any suggestions?

>Solution :

Yes, it called DataFrame.apply for processing function by each column:

#default axis=0
df = df.apply(scale)
#df = df.apply(scale, axis=0)

print (df)
          a         b
0  0.000000  0.006410
1  0.047619  0.070513
2  0.095238  0.070513
3  0.142857  0.070513
4  0.190476  0.070513
5  0.238095  0.711538
6  0.285714  0.000000

Here is possible pass all columns together – so performance is better:

df = scale(df)
#alternative
df = df.pipe(scale)
print (df)
          a         b
0  0.000000  0.006410
1  0.047619  0.070513
2  0.095238  0.070513
3  0.142857  0.070513
4  0.190476  0.070513
5  0.238095  0.711538
6  0.285714  0.000000

Performance:

#[70000 rows x 20 columns]
df = pd.concat([df] * 10000, ignore_index=True)
df = pd.concat([df] * 10, axis=1, ignore_index=True)


In [72]: %timeit df.apply(scale)
57.3 ms ± 3.32 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)

In [73]: %timeit df.pipe(scale)
38.5 ms ± 241 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)

In [74]: %timeit scale(df)
38.6 ms ± 780 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)