Create new column by selecting dataframe subset based on current row value and sum another column


Lets say we have a sample dataframe that looks like this

# Create a sample dataframe
df = pd.DataFrame({'num_posts': [4, 6, 3, 9, 1, 14, 2, 5, 7, 2,12],
                   'date' : ['2020-03-01', '2020-01-02', '2020-01-03', 
                            '2020-01-04', '2019-01-05', '2019-01-06', 
                            '2020-01-07', '2020-01-08', '2020-01-09', 

Now i need to create features such as sum(num_posts) for last week of previous year, ratio of last week of prev year by last month of previous year etc.

What i am doing right now is

def create_calendar_columns(df,dt):
    df['day_number']= pd.to_datetime(df[dt]).dt.dayofyear
    df['week_number']= pd.to_datetime(df[dt]).dt.isocalendar().week
    df['year']= pd.to_datetime(df[dt]).dt.year
    return df

and then

def calculate_curr_year_sum(df,cols,dt):
    for col in cols:
         df['curr_year_sum_'+str(col)]= df.loc[df['year']==df[dt].dt.year,col].sum()  
          # sum of 'col' for all rows having same year as of current row value        

But it does not select the subset of dataframe having same year value. Instead it take the sum of whole dataframe. It insert value=65 in all rows which is sum of "num_posts"

How can i select the subset of dataframe where rows with year equal to current row value for the column?

>Solution :

That’s because df['year']==df[dt].dt.year is True everywhere. You never iterate over the different years, so you won’t see any different results for different years. What I think you want to do is a groupby transformation:

>>> df.groupby("year")["num_posts"].transform("sum")
0     50
1     50
2     50
3     50
4     15
5     15
6     50
7     50
8     50
9     50
10    50

Leave a ReplyCancel reply