Sum n consecutive entries in a sequence

apologies if the terminology isn’t right, I’m not quite sure how to ask the question accurately.

I would like to summarise my data, which for the purposes of a MWE consists of the time (in seconds) since starting the measurement (called ‘Start scan’ here), and then the total value (called ‘Sum.’ here). I have filtered the data down to those measurements above a certain cutoff. As the measurements rise and fall with a delay, each episode above the cutoff results in a few rows of data. I would like to reduce this down, so that each separate sequence is represented only by the first row in the sequence, and, if at all possible, also to create a new variable for this row containing the original number of rows

Here’s a short example.

|       |   Start scan |        Sum. |
|------:|-------------:|------------:|
| 38191 |          247 | 7.06487e+06 |
| 38192 |          248 | 8.64094e+06 |
| 38193 |          249 | 8.82956e+06 |
| 38194 |          250 | 1.20744e+07 |
| 38195 |          251 | 1.28402e+07 |
| 38216 |          272 | 1.85569e+07 |
| 38217 |          273 | 2.83452e+07 |
| 38218 |          274 | 2.6532e+07  |
| 38219 |          275 | 2.2365e+07  |
| 38220 |          276 | 9.0689e+06  |
| 38230 |          286 | 6.92365e+06 |
| 38231 |          287 | 1.0727e+07  |
| 38232 |          288 | 7.81559e+06 |
| 38247 |          303 | 8.14354e+06 |
| 38381 |          437 | 8.4836e+06  |
| 38382 |          438 | 1.56635e+07 |
| 38383 |          439 | 1.24743e+07 |
| 38384 |          440 | 1.84585e+07 |
| 38385 |          441 | 1.82205e+07 |
| 38386 |          442 | 1.79627e+07 |
| 38387 |          443 | 8.56578e+06 |
| 38388 |          444 | 2.28438e+07 |
| 38389 |          445 | 8.19495e+06 |
| 38391 |          447 | 1.39808e+07 |
| 38392 |          448 | 8.30781e+06 |

I would like this converted to something like:

|       |   Start scan |        Sum. | scans summarised |
|------:|-------------:|------------:|-----------------:|
| 38191 |          247 | 7.06487e+06 |                 5|
| 38216 |          272 | 1.85569e+07 |                 5|
| 38230 |          286 | 6.92365e+06 |                 3|
| 38247 |          303 | 8.14354e+06 |                 1|
| 38381 |          437 | 8.4836e+06  |                 9|
| 38391 |          447 | 1.39808e+07 |                 2|

I’ve tried my best but can’t find a way to do this.

Very grateful for any help or pointers.

>Solution :

You can aggregate by consecutive values created by compare difference by Series.diff with 1 and cumulative sum by Series.cumsum, then aggregate columns in GroupBy.agg by GroupBy.first, GroupBy.sum and DataFrameGroupBy.size:

d = {'Start scan':('Start scan','first'), 
     'Sum.':('Sum.','sum'), 
     'scans summarised':('Sum.','size')}
out = df.groupby(df['Start scan'].diff().ne(1).cumsum()).agg(**d).reset_index(drop=True)
print (out)
   Start scan         Sum.  scans summarised
0         247   49449970.0                 5
1         272  104868000.0                 5
2         286   25466240.0                 3
3         303    8143540.0                 1
4         437  130867630.0                 9
5         447   22288610.0                 2

If original indices are important:

d = {'i':('index','first'), 
     'Start scan':('Start scan','first'),
     'Sum.':('Sum.','sum'), 
     'scans summarised':('Sum.','size')}

df1 = df.reset_index()
out = (df1.groupby(df1['Start scan'].diff().ne(1).cumsum())
          .agg(**d).set_index('i').rename_axis(None))
print (out)
       Start scan         Sum.  scans summarised
38191         247   49449970.0                 5
38216         272  104868000.0                 5
38230         286   25466240.0                 3
38247         303    8143540.0                 1
38381         437  130867630.0                 9
38391         447   22288610.0                 2

Leave a Reply