apologies if the terminology isn’t right, I’m not quite sure how to ask the question accurately.
I would like to summarise my data, which for the purposes of a MWE consists of the time (in seconds) since starting the measurement (called ‘Start scan’ here), and then the total value (called ‘Sum.’ here). I have filtered the data down to those measurements above a certain cutoff. As the measurements rise and fall with a delay, each episode above the cutoff results in a few rows of data. I would like to reduce this down, so that each separate sequence is represented only by the first row in the sequence, and, if at all possible, also to create a new variable for this row containing the original number of rows
Here’s a short example.
| | Start scan | Sum. |
|------:|-------------:|------------:|
| 38191 | 247 | 7.06487e+06 |
| 38192 | 248 | 8.64094e+06 |
| 38193 | 249 | 8.82956e+06 |
| 38194 | 250 | 1.20744e+07 |
| 38195 | 251 | 1.28402e+07 |
| 38216 | 272 | 1.85569e+07 |
| 38217 | 273 | 2.83452e+07 |
| 38218 | 274 | 2.6532e+07 |
| 38219 | 275 | 2.2365e+07 |
| 38220 | 276 | 9.0689e+06 |
| 38230 | 286 | 6.92365e+06 |
| 38231 | 287 | 1.0727e+07 |
| 38232 | 288 | 7.81559e+06 |
| 38247 | 303 | 8.14354e+06 |
| 38381 | 437 | 8.4836e+06 |
| 38382 | 438 | 1.56635e+07 |
| 38383 | 439 | 1.24743e+07 |
| 38384 | 440 | 1.84585e+07 |
| 38385 | 441 | 1.82205e+07 |
| 38386 | 442 | 1.79627e+07 |
| 38387 | 443 | 8.56578e+06 |
| 38388 | 444 | 2.28438e+07 |
| 38389 | 445 | 8.19495e+06 |
| 38391 | 447 | 1.39808e+07 |
| 38392 | 448 | 8.30781e+06 |
I would like this converted to something like:
| | Start scan | Sum. | scans summarised |
|------:|-------------:|------------:|-----------------:|
| 38191 | 247 | 7.06487e+06 | 5|
| 38216 | 272 | 1.85569e+07 | 5|
| 38230 | 286 | 6.92365e+06 | 3|
| 38247 | 303 | 8.14354e+06 | 1|
| 38381 | 437 | 8.4836e+06 | 9|
| 38391 | 447 | 1.39808e+07 | 2|
I’ve tried my best but can’t find a way to do this.
Very grateful for any help or pointers.
>Solution :
You can aggregate by consecutive values created by compare difference by Series.diff
with 1
and cumulative sum by Series.cumsum
, then aggregate columns in GroupBy.agg
by GroupBy.first
, GroupBy.sum
and DataFrameGroupBy.size
:
d = {'Start scan':('Start scan','first'),
'Sum.':('Sum.','sum'),
'scans summarised':('Sum.','size')}
out = df.groupby(df['Start scan'].diff().ne(1).cumsum()).agg(**d).reset_index(drop=True)
print (out)
Start scan Sum. scans summarised
0 247 49449970.0 5
1 272 104868000.0 5
2 286 25466240.0 3
3 303 8143540.0 1
4 437 130867630.0 9
5 447 22288610.0 2
If original indices are important:
d = {'i':('index','first'),
'Start scan':('Start scan','first'),
'Sum.':('Sum.','sum'),
'scans summarised':('Sum.','size')}
df1 = df.reset_index()
out = (df1.groupby(df1['Start scan'].diff().ne(1).cumsum())
.agg(**d).set_index('i').rename_axis(None))
print (out)
Start scan Sum. scans summarised
38191 247 49449970.0 5
38216 272 104868000.0 5
38230 286 25466240.0 3
38247 303 8143540.0 1
38381 437 130867630.0 9
38391 447 22288610.0 2