Calculate clusters within a spectrum

byMR

April 29, 2024

I have a data frame containing two parameters, val1 and val2.

val1 represents a point in a spectrum
val2 the measured level of this point

Now I want to cluster this spectrum with the following logic:

Compare val1 with the previous value. If previous_row >= 1.3 * val1 | previous_row <= 0.7 * val1 it’s considered a new cluster.

I have no idea how to solve this without looping and as the real-world data is quite big (millions of rows), performance is crucial. Any help is highly appreciated.

Here is a visualized example related to MWE:

MWE

import pandas as pd
import numpy as np

np.random.seed(3)

df = pd.DataFrame({
    "val1":np.arange(1,11)
    , "val2":np.random.randint(20, 50, 10)
})

df["desired_output"] = [1, 2, 2, 3, 4, 5, 5, 6, 6, 6]

   val1  val2  desired_output
0     1    30               1
1     2    44               2
2     3    45               2
3     4    23               3
4     5    44               4
5     6    28               5
6     7    20               6
7     8    41               7
8     9    49               7
9    10    39               7

>Solution :

Access the previous values with shift, compute a mask and use cumsum to form the groups:

prev = df['val2'].shift()
df['out'] = (prev.ge(df['val2']*1.3) | prev.le(df['val2']*0.7)).cumsum()

Output:

   val1  val2  desired_output  out
0     1    30               1    0
1     2    44               2    1
2     3    45               2    1
3     4    23               3    2
4     5    44               4    3
5     6    28               5    4
6     7    20               5    5
7     8    41               6    6
8     9    49               6    6
9    10    39               6    6

If you want to start the count from 1:

prev = df['val2'].shift()
m = prev.ge(df['val2']*1.3) | prev.le(df['val2']*0.7)
df['out'] = m.cumsum().add(1-m[:1].sum())

# or if you only have finite values in the input
prev = df['val2'].shift(fill_value=np.inf)
df['out'] = (prev.ge(df['val2']*1.3) | prev.le(df['val2']*0.7)).cumsum()

Output:

   val1  val2  desired_output  out
0     1    30               1    1
1     2    44               2    2
2     3    45               2    2
3     4    23               3    3
4     5    44               4    4
5     6    28               5    5
6     7    20               5    6
7     8    41               6    7
8     9    49               6    7
9    10    39               6    7