Follow

Keep Up to Date with the Most Important News

By pressing the Subscribe button, you confirm that you have read and are agreeing to our Privacy Policy and Terms of Use
Contact

Create new column by selecting dataframe subset based on current row value and sum another column

Lets say we have a sample dataframe that looks like this

# Create a sample dataframe
df = pd.DataFrame({'num_posts': [4, 6, 3, 9, 1, 14, 2, 5, 7, 2,12],
                   'date' : ['2020-03-01', '2020-01-02', '2020-01-03', 
                            '2020-01-04', '2019-01-05', '2019-01-06', 
                            '2020-01-07', '2020-01-08', '2020-01-09', 
                            '2020-01-10','2020-01-11']})
cols=['num_posts']  

Now i need to create features such as sum(num_posts) for last week of previous year, ratio of last week of prev year by last month of previous year etc.

What i am doing right now is

MEDevel.com: Open-source for Healthcare and Education

Collecting and validating open-source software for healthcare, education, enterprise, development, medical imaging, medical records, and digital pathology.

Visit Medevel

def create_calendar_columns(df,dt):
    df['day_number']= pd.to_datetime(df[dt]).dt.dayofyear
    df['week_number']= pd.to_datetime(df[dt]).dt.isocalendar().week
    df['year']= pd.to_datetime(df[dt]).dt.year
    return df

and then

def calculate_curr_year_sum(df,cols,dt):
    df=create_calendar_columns(df,dt)
    df[dt]=pd.to_datetime(df[dt])
    for col in cols:
         df['curr_year_sum_'+str(col)]= df.loc[df['year']==df[dt].dt.year,col].sum()  
          # sum of 'col' for all rows having same year as of current row value        

But it does not select the subset of dataframe having same year value. Instead it take the sum of whole dataframe. It insert value=65 in all rows which is sum of "num_posts"

How can i select the subset of dataframe where rows with year equal to current row value for the column?

>Solution :

That’s because df['year']==df[dt].dt.year is True everywhere. You never iterate over the different years, so you won’t see any different results for different years. What I think you want to do is a groupby transformation:

>>> df.groupby("year")["num_posts"].transform("sum")
0     50
1     50
2     50
3     50
4     15
5     15
6     50
7     50
8     50
9     50
10    50
Add a comment

Leave a Reply

Keep Up to Date with the Most Important News

By pressing the Subscribe button, you confirm that you have read and are agreeing to our Privacy Policy and Terms of Use

Discover more from Dev solutions

Subscribe now to keep reading and get access to the full archive.

Continue reading