Follow

Keep Up to Date with the Most Important News

By pressing the Subscribe button, you confirm that you have read and are agreeing to our Privacy Policy and Terms of Use
Contact

How to add a column or change data in each group after using group by in Pandas?

I am now using Pandas to handle some data. After I used group by in pandas, the simplified DataFrame’s format is [MMSI(Vessel_ID), BaseTime, Location, Speed, Course,…].

I use

for MMSI, group in grouped_df:
    print(MMSI)
    print(group)

to print the data.

MEDevel.com: Open-source for Healthcare and Education

Collecting and validating open-source software for healthcare, education, enterprise, development, medical imaging, medical records, and digital pathology.

Visit Medevel

For example, one group of data is:

             MMSI         BaseDateTime       LAT        LON  SOG  COG
1507  538007509.0  2022-12-08T00:02:25  49.29104 -123.19135  0.0  9.6   
1508  538007509.0  2022-12-08T00:05:25  49.29102 -123.19138  0.0  9.6   

I want to add a column which is the time difference of two points.

Below is the Output I want

             MMSI         BaseDateTime       LAT        LON  SOG  COG   Time-diff
1507  538007509.0  2022-12-08T00:02:25  49.29104 -123.19135  0.0  9.6   3.0(hours)
1508  538007509.0  2022-12-08T00:05:25  49.29102 -123.19138  0.0  9.6   Na

So I use the code below to try to get the result:

for MMSI, group in grouped_df:
    group = group.sort_values(by='BaseDateTime')
    group['new-time'] = group.shift(-1)['BaseDateTime']
    group.dropna()

    for x in group.index:
      group.loc[x,'time-diff'] = get_timediff(group.loc[x,'new-time'],group.loc[x,'BaseDateTime']) # A function to calculate the time difference


    group['GROUP'] = group['time-diff'].fillna(np.inf).ge(2).cumsum()
    # When Time-diff >= 2hours split them into different group

I can use print to show group[‘GROUP’] and group[‘time-diff’]. The result is not shown after I tried to visit grouped_df again. There’s a warning showing that my group in grouped_df is just a copy of a slice from a DataFrame and it recommend me using .loc[row_indexer,col_indexer] = value instead. But in this case I don’t know how to use .loc to visit the specific [row,col].

At the very beginning, I tried to use

  grouped_df['new-time'] = grouped_df.shift(-1)['BaseDateTime']
  grouped_df.dropna()

But it shows

'DataFrameGroupBy' object does not support item assignment

Now my solution is create an empty_df and then concatenate the groups in grouped_df step by step like this:

df['time-diff'] = pd.Series(dtype='float64')
df['GROUP'] = pd.Series(dtype='int')
grouped_df = df.groupby('MMSI')
for MMSI, group in grouped_df:

    # ... as the same as the code above
    group = group.sort_values(by='BaseDateTime')
    group['new-time'] = group.shift(-1)['BaseDateTime']
    group.dropna()

    for x in group.index:
      group.loc[x,'time-diff'] = get_timediff(group.loc[x,'new-time'],group.loc[x,'BaseDateTime']) # A function to calculate the time difference


    group['GROUP'] = group['time-diff'].fillna(np.inf).ge(2).cumsum()
    # ... as the same as the code above

    frame = [empty_df, group]
    empty_df = pd.concat(frames)

I am not satisfied with this solution but I didn’t find a proper way to change the value in grouped_df.

I’m now trying to use the solution from this question to handle the DataFrame before group by.

Can someone help me?

>Solution :

Don’t use a loop, directly go with groupby.shift:

s = pd.to_datetime(df['BaseDateTime'])

df['Time-diff'] = (s.groupby(df['MMSI']).shift(-1)
                    .sub(s).dt.total_seconds().div(3600)
                  )

Or groupby.diff:

s = pd.to_datetime(df['BaseDateTime'])

df['Time-diff'] = (s.groupby(df['MMSI']).diff(-1)
                    .mul(-1).dt.total_seconds().div(3600)
                  )

Output:

             MMSI         BaseDateTime       LAT        LON  SOG  COG  Time-diff
1507  538007509.0  2022-12-08T00:02:25  49.29104 -123.19135  0.0  9.6       0.05
1508  538007509.0  2022-12-08T00:05:25  49.29102 -123.19138  0.0  9.6        NaN
1509  538007510.0  2022-12-08T00:02:25  49.29104 -123.19135  0.0  9.6       0.05
1510  538007510.0  2022-12-08T00:05:25  49.29102 -123.19138  0.0  9.6        NaN
1511  538007511.0  2022-12-08T00:02:25  49.29104 -123.19135  0.0  9.6       0.05
1523  538007511.0  2022-12-08T00:05:25  49.29102 -123.19138  0.0  9.6        NaN
Add a comment

Leave a Reply

Keep Up to Date with the Most Important News

By pressing the Subscribe button, you confirm that you have read and are agreeing to our Privacy Policy and Terms of Use

Discover more from Dev solutions

Subscribe now to keep reading and get access to the full archive.

Continue reading