I’m working on optimizing some code for speed and I understand that .apply within Pandas is not the optimal method for data processing. I’m looking to optimize this code in some way.
Say you have two columns like such
| Set_One | Set_Two |
|---|---|
| ‘Foo’ | ” |
| ” | ‘Bar’ |
| ‘Foo’ | ‘Bar’ |
If I were to want the first truthy value from the two columns in an apply statement, I could just use
df.apply(lambda x : x.Set_One or x.Set_Two)
This would result in
| Result |
|---|
| ‘Foo’ |
| ‘Bar’ |
| ‘Foo’ |
But this slows down as the data set gets larger. Not by much considering that it’s a fairly simple comparison.
The nice solution would be if I could do result = df['Set_One'] or df['Set_Two'] but the truth value of a series is ambiguous. Using a & or | operator doesn’t consider truthiness (or work on strings for that matter).
What is the best practice for this comparison?
So far, I’ve tried just a vectorized function that is functionally just returning the first truthy value.
def return_truthy(val1, val2):
return val1 or val2
vec_truthy = np.vectorize(return_truthy)
vec_truthy(df['Set_One'], df['Set_Two'])
Using this function does result in about a 10x speed improvement, but it doesn’t exactly look elegant.
>Solution :
You can do this using the numpy.logical_or function.
import pandas as pd
import numpy as np
df = pd.DataFrame({"Set_One":["Foo", "", "Foo"],
"Set_Two":["", "Bar", "Bar"]})
print(np.logical_or(df["Set_One"], df["Set_Two"]))
Output:
0 Foo
1 Bar
2 Foo
dtype: object