Follow

Keep Up to Date with the Most Important News

By pressing the Subscribe button, you confirm that you have read and are agreeing to our Privacy Policy and Terms of Use
Contact

How to identify non-contiguous ranges in Pandas groupby?

Consider the following example, where I attempt to automatically extract contiguous ranges of data samples, based on numrun and test columns:

import sys
if sys.version_info[0] < 3:
    from StringIO import StringIO
else:
    from io import StringIO

import pandas as pd

# to demonstrate problem, replace lines 8,9,10 below with:
# 63182.8423443,0,ZZ,661,615,767
# 63183.1940174,0,ZZ,661,615,767
# 63183.6097631,0,ZZ,661,615,767
TESTDATA = StringIO("""tstamp,numrun,test,valA,valB,valC
63180.6208982,0,AA,661,615,767
63180.9724180,0,AA,661,615,767
63181.3240476,0,AA,661,615,767
63181.7246253,0,ZZ,661,615,767
63182.0752904,0,BB,661,615,767
63182.4280107,0,BB,661,615,767
63182.8423443,0,CC,661,615,767
63183.1940174,0,CC,661,615,767
63183.6097631,0,CC,661,615,767
63183.9612234,1,AA,661,615,767
63184.3441680,1,AA,661,615,767
63184.6971598,1,AA,661,615,767
63185.0964634,1,BB,661,615,767
63185.4480095,1,BB,660,615,767
63185.8035320,1,BB,661,615,767
63186.1987603,1,BB,660,615,767
63186.5500834,1,BB,661,615,767
63186.9661803,1,BB,661,615,767
63187.3186681,2,AA,660,615,767
63187.7182817,2,AA,660,615,767
63188.0696054,2,AA,660,615,767
63188.4689450,2,ZZ,660,615,767
63188.8204257,2,ZZ,660,615,767
63189.1719920,2,ZZ,660,615,767
63189.5240004,2,BB,660,615,767
63189.9121130,2,BB,660,615,767
63190.2593203,2,BB,660,615,767
""")

df_data = pd.read_csv(TESTDATA)
#print(df_data.head())

# use sort=False, else tstamp order gets mesed up!
grouped = df_data.groupby(['numrun', 'test'], sort=False)
for igx, (group_name, df_grouped) in enumerate(grouped):
  firstlastrow = df_grouped.iloc[[0, -1]]
  ttime_start = firstlastrow.iloc[0]["tstamp"]
  ttime_end = firstlastrow.iloc[1]["tstamp"]
  deltasec = ttime_end - ttime_start
  with pd.option_context('float_format', '{:f}'.format):
    print("{:02d}: {}: {:.2f} sec ({} - {})".format(igx+1, group_name, deltasec, ttime_start, ttime_end ))

The above example, in fact, works great – when I run it, I get the expected printout:

$ python3 /tmp/test.py
01: (0, 'AA'): 0.70 sec (63180.6208982 - 63181.3240476)
02: (0, 'ZZ'): 0.00 sec (63181.7246253 - 63181.7246253)
03: (0, 'BB'): 0.35 sec (63182.0752904 - 63182.4280107)
04: (0, 'CC'): 0.77 sec (63182.8423443 - 63183.6097631)
05: (1, 'AA'): 0.74 sec (63183.9612234 - 63184.6971598)
06: (1, 'BB'): 1.87 sec (63185.0964634 - 63186.9661803)
07: (2, 'AA'): 0.75 sec (63187.3186681 - 63188.0696054)
08: (2, 'ZZ'): 0.70 sec (63188.468945 - 63189.171992)
09: (2, 'BB'): 0.74 sec (63189.5240004 - 63190.2593203)

However, my actual data, has entries labeled (say) numrun = 0 and test=’ZZ’, which can be seen as non-contiguous: that is, you might get a range of 0/ZZ labeled data samples, then another label might follow, and then again a range of 0/ZZ labeled data samples appear.

MEDevel.com: Open-source for Healthcare and Education

Collecting and validating open-source software for healthcare, education, enterprise, development, medical imaging, medical records, and digital pathology.

Visit Medevel

To simulate that, replace the lines in TESTDATA as noted in the to demonstrate problem ... comment in the code, and run the script again. I get:

$ python3 /tmp/test.py
01: (0, 'AA'): 0.70 sec (63180.6208982 - 63181.3240476)
02: (0, 'ZZ'): 1.89 sec (63181.7246253 - 63183.6097631)
03: (0, 'BB'): 0.35 sec (63182.0752904 - 63182.4280107)
04: (1, 'AA'): 0.74 sec (63183.9612234 - 63184.6971598)
05: (1, 'BB'): 1.87 sec (63185.0964634 - 63186.9661803)
06: (2, 'AA'): 0.75 sec (63187.3186681 - 63188.0696054)
07: (2, 'ZZ'): 0.70 sec (63188.468945 - 63189.171992)
08: (2, 'BB'): 0.74 sec (63189.5240004 - 63190.2593203)

Note that only one (0, ‘ZZ’) range is identified, whereas the change in the data introduced one more (so two in total).

Is it somehow possible to instruct Pandas, to extract all the ranges in the above data – and where there is a repeated combination of labels, maybe add a random or counter string to the name, so that all contiguous ranges are identified regardless?

Basically the desired output of the changed data would be something like this:

01: (0, 'AA'): 0.70 sec (63180.6208982 - 63181.3240476)
02: (0, 'ZZ'): 0.00 sec (63181.7246253 - 63181.7246253)
03: (0, 'BB'): 0.35 sec (63182.0752904 - 63182.4280107)
04: (0, 'ZZ+1'): 0.77 sec (63182.8423443 - 63183.6097631)
05: (1, 'AA'): 0.74 sec (63183.9612234 - 63184.6971598)
06: (1, 'BB'): 1.87 sec (63185.0964634 - 63186.9661803)
07: (2, 'AA'): 0.75 sec (63187.3186681 - 63188.0696054)
08: (2, 'ZZ'): 0.70 sec (63188.468945 - 63189.171992)
09: (2, 'BB'): 0.74 sec (63189.5240004 - 63190.2593203)

>Solution :

You can create consecutive groups by both columns and add to groupby:

g = df_data[['numrun', 'test']].ne(df_data[['numrun', 'test']].shift()).any(axis=1).cumsum()

grouped = df_data.groupby(['numrun', 'test', g], sort=False)
for igx, (group_name, df_grouped) in enumerate(grouped):
    #your code
Add a comment

Leave a Reply

Keep Up to Date with the Most Important News

By pressing the Subscribe button, you confirm that you have read and are agreeing to our Privacy Policy and Terms of Use

Discover more from Dev solutions

Subscribe now to keep reading and get access to the full archive.

Continue reading