Follow

Keep Up to Date with the Most Important News

By pressing the Subscribe button, you confirm that you have read and are agreeing to our Privacy Policy and Terms of Use
Contact

setting -t, +t around the event date, NaN giving a headache

I have a df

id   date   eventdate
A    2020Q1 2020Q3
A    2020Q2 2020Q3
A    2020Q3 2020Q3
A    2020Q4 2020Q3
B    2019Q1 2019Q2
B    2019Q2 2019Q2
B    2019Q3 2019Q2
B    2019Q4 2019Q2
C    2020Q1 NaN
C    2020Q2 NaN
C    2020Q3 NaN
C    2020Q4 NaN
D    2019Q2 NaN
D    2019Q3 NaN
D    2019Q4 NaN
...

I want to calculate the time index around the event date such that if both dates match, t=0, and -t and +t around the event date.

desired df

MEDevel.com: Open-source for Healthcare and Education

Collecting and validating open-source software for healthcare, education, enterprise, development, medical imaging, medical records, and digital pathology.

Visit Medevel

id   date   eventdate  t
A    2020Q1 2020Q3     -2
A    2020Q2 2020Q3     -1
A    2020Q3 2020Q3      0
A    2020Q4 2020Q3      1
B    2019Q1 2019Q2     -1
B    2019Q2 2019Q2      0
B    2019Q3 2019Q2      1
B    2019Q4 2019Q2      2
C    2020Q1 NaN         NaN
C    2020Q2 NaN         NaN
C    2020Q3 NaN         NaN
C    2020Q4 NaN         NaN
D    2019Q2 NaN         NaN
D    2019Q3 NaN         NaN
D    2019Q4 NaN         NaN

So i use the following code

df['t'] = (pd.PeriodIndex(df['date'], freq='Q').astype('int') - pd.PeriodIndex(df['eventdate'], freq='Q').astype('int'))

and i get strangely big negative numbers for which eventdates aren’t available

count    8.812475e+06
mean    -4.565033e+18
std      4.611450e+18
min     -9.223372e+18
25%     -9.223372e+18
50%     -8.000000e+00
75%      3.000000e+00
max      2.300000e+01
Name: t, dtype: float64

Any idea what’s causing this issue? thanks.

>Solution :

Problem is that pd.PeriodIndex converts NaN value to NaT. When you convert a NaT value to int with .astype('int'), it gives -9223372036854775808.

You can check the data is NaT when accessing the n attribute

df['t'] = ((pd.PeriodIndex(df['date'], freq='Q') -
            pd.PeriodIndex(df['eventdate'], freq='Q'))
           .map(lambda x: np.nan if x != x else x.n))
           # or use pd.isna
           # .map(lambda x: np.nan if pd.isna(x) else x.n)

Or mask the NaN value

df['t'] = ((pd.PeriodIndex(df['date'], freq='Q').astype('int') -
            pd.PeriodIndex(df['eventdate'], freq='Q').astype('int'))
           .to_series(range(len(df))).mask(df[['date', 'eventdate']].isna().any(axis=1)))
print(df)

   id    date eventdate    t
0   A  2020Q1    2020Q3 -2.0
1   A  2020Q2    2020Q3 -1.0
2   A  2020Q3    2020Q3  0.0
3   A  2020Q4    2020Q3  1.0
4   B  2019Q1    2019Q2 -1.0
5   B  2019Q2    2019Q2  0.0
6   B  2019Q3    2019Q2  1.0
7   B  2019Q4    2019Q2  2.0
8   C  2020Q1       NaN  NaN
9   C  2020Q2       NaN  NaN
10  C  2020Q3       NaN  NaN
11  C  2020Q4       NaN  NaN
12  D  2019Q2       NaN  NaN
13  D  2019Q3       NaN  NaN
14  D  2019Q4       NaN  NaN
Add a comment

Leave a Reply

Keep Up to Date with the Most Important News

By pressing the Subscribe button, you confirm that you have read and are agreeing to our Privacy Policy and Terms of Use

Discover more from Dev solutions

Subscribe now to keep reading and get access to the full archive.

Continue reading