Home How to split a DataFrame into separate groups based on a Value(Difference Value of Time)

Questions

How to split a DataFrame into separate groups based on a Value(Difference Value of Time)

byMR

January 20, 2024

I’m now using Pandas in Python to handle some data.

Simplified DataFrame is

[ID, TimeDiff] and some other not-important columns.

For example:

73    1.166667
74    1.166667
75    2.183333
76    3.466667
77    2.666667
78    Na

TimeDiff means the difference value of time in data ID. 3.466667 means the time between ID 76 and ID 77 is 3.466667 hours.
I want to split vessel data to make sure the difference value of time is within 2 hours
so I need to split DataFrame into N different groups (in this example, N = 4).

I need to get results like this: whenever TimeDiff >= 2 create another group

--------------
73    1.166667
74    1.166667
-------------
75    2.183333
--------------
76    3.466667
--------------
77         NaN

I have tried to use Groupby in pandas.

df.groupby('TimeDiff')

But obviously, this is not what I want.
I’m now trying to split DataFrame step by step like this:
From

73    1.166667
74    1.166667
--------------
75    2.183333
76    3.466667
77    2.666667
78    Na

73    1.166667
74    1.166667
--------------
75    2.183333
--------------
76    3.466667
77    2.666667
78    Na

Then To

73    1.166667
74    1.166667
--------------
75    2.183333
--------------
76    3.466667
77    2.666667
78    Na

……

Finally to what I want:

--------------
73    1.166667
74    1.166667
-------------
75    2.183333
--------------
76    3.466667
--------------
77         NaN

4 groups Data. But after searching Google and StackOverflow I didn’t find a proper way to handle it. Can somebody help me?

>Solution :

You can use cumsum to create your groups:

df['Group'] = df['TimeDiff'].fillna(np.inf).ge(2).cumsum()
print(df)

# Output
   ID  TimeDiff  Group
0  73  1.166667      0
1  74  1.166667      0
2  75  2.183333      1
3  76  3.466667      2
4  77  2.666667      3
5  78       NaN      4

Using groupby:

>>> list(df.groupby(df['TimeDiff'].fillna(np.inf).ge(2).cumsum()))
[(0,
     ID  TimeDiff  Group
  0  73  1.166667      0
  1  74  1.166667      0),

 (1,
     ID  TimeDiff  Group
  2  75  2.183333      1),

 (2,
     ID  TimeDiff  Group
  3  76  3.466667      2),

 (3,
     ID  TimeDiff  Group
  4  77  2.666667      3),

 (4,
     ID  TimeDiff  Group
  5  78       NaN      4)]