Follow

Keep Up to Date with the Most Important News

By pressing the Subscribe button, you confirm that you have read and are agreeing to our Privacy Policy and Terms of Use
Contact

Copy the row in pandas from the nearest year

Lets say that I have the following datasets.

import numpy as np
import pandas as pd

d = {'name': ['a', 'a', 'a'], 'year': [2000, 2001, 2002], 'value': [10, 17, 22] }
data_frame  = pd.DataFrame(data=d)
data_frame

d1 = {'name': ['a', 'a', 'a'], 'year': [2001, 2002, 2003], 'value': [10, 17, 22] }
data_frame1  = pd.DataFrame(data=d1)
data_frame1

d2 = {'name': ['a', 'a', 'a'], 'year': [2000, 2002, 2003], 'value': [10, 17, 22] }
data_frame2  = pd.DataFrame(data=d2)
data_frame2

What I want to do is following:

  1. In data_frame, I miss observations for 2003, hence I want to copy the row which contains 2002 (the earliest available year) (data_frame[data_frame['year']==2002]) append it to my data_frame and replace 2002 with 2003. Rest stays the same.

    MEDevel.com: Open-source for Healthcare and Education

    Collecting and validating open-source software for healthcare, education, enterprise, development, medical imaging, medical records, and digital pathology.

    Visit Medevel

  2. In data_frame1, I miss observations for 2000, I want to do the same as in 1, but in that case, I want to use the row which contains the year 2001.

  3. As for the data_frame2, I do not have observations for 2001. Here I want to use the earliest year, e.i. 2000.

Is there any way that I can do this, without inspecting each data frame separately, but instead write some general function?

Thanks.

>Solution :

Use DataFrame.reindex with all years with parameter method='nearest', but for data_frame2 is used next year:

dfs = [data_frame,data_frame1, data_frame2]

for df in dfs:
    df = df.set_index('year').reindex(range(2000, 2004), method='nearest')
    print (df)
         name  value
    year            
    2000    a     10
    2001    a     17
    2002    a     22
    2003    a     22
         name  value
    year            
    2000    a     10
    2001    a     10
    2002    a     17
    2003    a     22
         name  value
    year            
    2000    a     10
    2001    a     17
    2002    a     17
    2003    a     22

If need previous year if exist, last replace next year use:

for df in dfs:
    df = df.set_index('year').reindex(range(2000, 2004), method='ffill').bfill()
    print (df)
         name  value
    year            
    2000    a     10
    2001    a     17
    2002    a     22
    2003    a     22
         name  value
    year            
    2000    a   10.0
    2001    a   10.0
    2002    a   17.0
    2003    a   22.0
         name  value
    year            
    2000    a     10
    2001    a     10
    2002    a     17
    2003    a     22

Solution per groups:

for df in dfs:
    df = (df.set_index('year').groupby('name', group_keys=False)
            .apply(lambda x: x.reindex(range(2000, 2004), method='ffill').bfill()))
    print (df)
         name  value
    year            
    2000    a     10
    2001    a     17
    2002    a     22
    2003    a     22
         name  value
    year            
    2000    a   10.0
    2001    a   10.0
    2002    a   17.0
    2003    a   22.0
         name  value
    year            
    2000    a     10
    2001    a     10
    2002    a     17
    2003    a     22
Add a comment

Leave a Reply

Keep Up to Date with the Most Important News

By pressing the Subscribe button, you confirm that you have read and are agreeing to our Privacy Policy and Terms of Use

Discover more from Dev solutions

Subscribe now to keep reading and get access to the full archive.

Continue reading