i want to apply a function to the columns of this dataframe:
df = pd.DataFrame({"a":[0,1,2,3,4,5,6], "b": [10,100,100,100,100,1000,1]})
def scale(values):
scaled_values = (values - np.min(values)) / (np.max(values) - np.min(values))
return scaled_values / np.sum(scaled_values)
This function requires to consider all values in the column at once. Thus i cannot apply the function with the "apply" method of pandas. Thus at the moment i just call
df.loc[:,"a"] = scale(df["a"])
df.loc[:,"b"] = scale(df["b"])
However, if i have a lot of columns i don’t want to do it like this and also looping over the columns is quite ugly:
for c in df.columns:
df.loc[:,c] = scale(df[c])
I’m wondering if there is a pandas method to "apply" a method to the whole column at once so that i can get rid of this ugly loop. Any suggestions?
>Solution :
Yes, it called DataFrame.apply for processing function by each column:
#default axis=0
df = df.apply(scale)
#df = df.apply(scale, axis=0)
print (df)
a b
0 0.000000 0.006410
1 0.047619 0.070513
2 0.095238 0.070513
3 0.142857 0.070513
4 0.190476 0.070513
5 0.238095 0.711538
6 0.285714 0.000000
Here is possible pass all columns together – so performance is better:
df = scale(df)
#alternative
df = df.pipe(scale)
print (df)
a b
0 0.000000 0.006410
1 0.047619 0.070513
2 0.095238 0.070513
3 0.142857 0.070513
4 0.190476 0.070513
5 0.238095 0.711538
6 0.285714 0.000000
Performance:
#[70000 rows x 20 columns]
df = pd.concat([df] * 10000, ignore_index=True)
df = pd.concat([df] * 10, axis=1, ignore_index=True)
In [72]: %timeit df.apply(scale)
57.3 ms ± 3.32 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
In [73]: %timeit df.pipe(scale)
38.5 ms ± 241 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)
In [74]: %timeit scale(df)
38.6 ms ± 780 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)