Follow

Keep Up to Date with the Most Important News

By pressing the Subscribe button, you confirm that you have read and are agreeing to our Privacy Policy and Terms of Use
Contact

Resampling data only if value is present

import pandas as pd

data = {'time': ['09:30:00',
                 '09:31:00',
                 '09:37:00',
                 '09:38:00',
                 '09:39:00',
                 '09:40:00',
                 '09:46:00',
                 '09:47:00',
                 '09:48:00'],
         'sum': [5, 8, 5, 10, 15, 2, 0, 0, 0]}
my_index = pd.MultiIndex.from_arrays([["A"]*6 + ["B"]*3, [1, 1, 1, 2, 2, 2, 1, 1, 1]], names=["ID-A", "ID-B"])
df = pd.DataFrame(data, index=my_index)

If I do resampling for every 3 minutes with sum(),i.e.->
data=data.set_index(‘time’).groupby([‘ID-A’,’ID-B’]).resample(‘3min’).sum()
There is a window({9:33 – 9:36},{9:42 – 9:45}) for which the sum() comes out to be 0. There are some values in my dataframe which actually evaluate to 0 even after the time windows are available for it(9:45-9:48). I do not want resampling for time windows where there is no data available.
I want to find out windows where my sum is 0 but due to resampling I am getting fake 0s in my data as there are no data available for those time.

MEDevel.com: Open-source for Healthcare and Education

Collecting and validating open-source software for healthcare, education, enterprise, development, medical imaging, medical records, and digital pathology.

Visit Medevel

>Solution :

Use min_count=1 parameter in sum and then remove missing values:

df['time'] = pd.to_timedelta(df['time'])

df = df.resample('3Min', on='time').sum(min_count=1).dropna()
print (df)
                  sum
time                 
0 days 09:30:00  13.0
0 days 09:36:00  15.0
0 days 09:39:00  17.0
0 days 09:45:00   0.0
0 days 09:48:00   0.0

Details:

print (df.resample('3Min', on='time').sum(min_count=1))
                  sum
time                 
0 days 09:30:00  13.0
0 days 09:33:00   NaN
0 days 09:36:00  15.0
0 days 09:39:00  17.0
0 days 09:42:00   NaN
0 days 09:45:00   0.0
0 days 09:48:00   0.0

EDIT: Solution per groups:

df['time'] = pd.to_timedelta(df['time'])

data = (df.reset_index()
          .set_index('time')
          .groupby(['ID-A','ID-B'])['sum']
          .resample('3min')
          .sum(min_count=1)
          .dropna())
print (data)
ID-A  ID-B  time           
A     1     0 days 09:30:00    13.0
            0 days 09:36:00     5.0
      2     0 days 09:38:00    27.0
B     1     0 days 09:46:00     0.0
Name: sum, dtype: float64
Add a comment

Leave a Reply

Keep Up to Date with the Most Important News

By pressing the Subscribe button, you confirm that you have read and are agreeing to our Privacy Policy and Terms of Use

Discover more from Dev solutions

Subscribe now to keep reading and get access to the full archive.

Continue reading