I have a data frame containing two parameters, val1 and val2.
val1represents a point in a spectrumval2the measured level of this point
Now I want to cluster this spectrum with the following logic:
- Compare
val1with the previous value. Ifprevious_row >= 1.3 * val1 | previous_row <= 0.7 * val1it’s considered a new cluster.
I have no idea how to solve this without looping and as the real-world data is quite big (millions of rows), performance is crucial. Any help is highly appreciated.
Here is a visualized example related to MWE:
MWE
import pandas as pd
import numpy as np
np.random.seed(3)
df = pd.DataFrame({
"val1":np.arange(1,11)
, "val2":np.random.randint(20, 50, 10)
})
df["desired_output"] = [1, 2, 2, 3, 4, 5, 5, 6, 6, 6]
val1 val2 desired_output
0 1 30 1
1 2 44 2
2 3 45 2
3 4 23 3
4 5 44 4
5 6 28 5
6 7 20 6
7 8 41 7
8 9 49 7
9 10 39 7
>Solution :
Access the previous values with shift, compute a mask and use cumsum to form the groups:
prev = df['val2'].shift()
df['out'] = (prev.ge(df['val2']*1.3) | prev.le(df['val2']*0.7)).cumsum()
Output:
val1 val2 desired_output out
0 1 30 1 0
1 2 44 2 1
2 3 45 2 1
3 4 23 3 2
4 5 44 4 3
5 6 28 5 4
6 7 20 5 5
7 8 41 6 6
8 9 49 6 6
9 10 39 6 6
If you want to start the count from 1:
prev = df['val2'].shift()
m = prev.ge(df['val2']*1.3) | prev.le(df['val2']*0.7)
df['out'] = m.cumsum().add(1-m[:1].sum())
# or if you only have finite values in the input
prev = df['val2'].shift(fill_value=np.inf)
df['out'] = (prev.ge(df['val2']*1.3) | prev.le(df['val2']*0.7)).cumsum()
Output:
val1 val2 desired_output out
0 1 30 1 1
1 2 44 2 2
2 3 45 2 2
3 4 23 3 3
4 5 44 4 4
5 6 28 5 5
6 7 20 5 6
7 8 41 6 7
8 9 49 6 7
9 10 39 6 7
