How can I group by a datetime column with timezone?

import pandas as pd

df = pd.DataFrame(
    {
        'StartDate':['2020-01-01 00:00:00-04:00', '2020-01-01 01:00:00-04:00', '2020-01-01 01:55:00-04:00', '2020-01-02 02:00:00-02:00', '2020-01-02 02:00:00-04:00'],
        'Weight':[100, 110, 120, 125, 155]
    }
)
df['StartDate'] = pd.to_datetime(df['StartDate'])

df

df1

I want to group the data by the hour and sum up the Weight column. So, the end result would be a df with 3 rows: current index 0, current indexes 1&2, current indexes 3&4.

I came across the Grouper function and I tried the following but it didn’t work:

df = df.groupby(pd.Grouper(key='StartDate', freq='H')).sum()

I get the following error:

TypeError: Only valid with DatetimeIndex, TimedeltaIndex or
PeriodIndex, but got an instance of ‘Index’

Does anyone know what I’m doing wrong or can someone provide a solution?

Thanks

>Solution :

You first need to convert to datetime, taking into account the timezones:

df['StartDate'] = pd.to_datetime(df['StartDate'], utc=True)

df.groupby(pd.Grouper(key='StartDate', freq='H')).sum()

Output:

                           Weight
StartDate                        
2020-01-01 04:00:00+00:00     100
2020-01-01 05:00:00+00:00     230
2020-01-01 06:00:00+00:00       0
2020-01-01 07:00:00+00:00       0
2020-01-01 08:00:00+00:00       0
2020-01-01 09:00:00+00:00       0
2020-01-01 10:00:00+00:00       0
2020-01-01 11:00:00+00:00       0
2020-01-01 12:00:00+00:00       0
2020-01-01 13:00:00+00:00       0
2020-01-01 14:00:00+00:00       0
2020-01-01 15:00:00+00:00       0
2020-01-01 16:00:00+00:00       0
2020-01-01 17:00:00+00:00       0
2020-01-01 18:00:00+00:00       0
2020-01-01 19:00:00+00:00       0
2020-01-01 20:00:00+00:00       0
2020-01-01 21:00:00+00:00       0
2020-01-01 22:00:00+00:00       0
2020-01-01 23:00:00+00:00       0
2020-01-02 00:00:00+00:00       0
2020-01-02 01:00:00+00:00       0
2020-01-02 02:00:00+00:00       0
2020-01-02 03:00:00+00:00       0
2020-01-02 04:00:00+00:00     125
2020-01-02 05:00:00+00:00       0
2020-01-02 06:00:00+00:00     155

without "blanks"

df.groupby(pd.to_datetime(df['StartDate'], utc=True).dt.floor('h'))['Weight'].sum()

StartDate
2020-01-01 04:00:00+00:00    100
2020-01-01 05:00:00+00:00    230
2020-01-02 04:00:00+00:00    125
2020-01-02 06:00:00+00:00    155
Name: Weight, dtype: int64

Leave a Reply