I have a data frame where each row summarises the monthly activity of forum members, including the number of posts, the number of replies, the number of hateful posts and the number of hateful replies they receive.
I wish to create a lagged column (n-1) for the number of hateful replies that each user receives. My data frame looks something like this:
| post_month | user_name | thread_posts | hate_posts | replies | hateful_replies |
|---|---|---|---|---|---|
| 2018-01 | !ronman | 11 | 4 | 19 | 6 |
| 2021-06 | $$$ | 43 | 11 | 138 | 39 |
| 2021-07 | $$$ | 51 | 16 | 169 | 57 |
| 2020-06 | $toneman | 111 | 37 | 1 | 1 |
| 2020-07 | $toneman | 141 | 33 | 23 | 9 |
| 2020-08 | $toneman | 65 | 14 | 0 | 0 |
| 2021-08 | Farnham | 13 | 0 | 4 | 0 |
However, doing:
df['hateful_replies-1']=df['hateful_replies'].shift(1)
Does not result in the desired outcome as it doesn’t consider user_name. I wish to shift rows conditional on user_name so that for each username, each month, the previous months hateful_replies value is added to the new column. In instances where there is only one month (one row) per username I would populate the lagged column as a zero, and same for the first month that each username appears since there is no previous month’s value to lag.
Hoping to achieve something like this:
| post_month | user_name | thread_posts | hate_posts | replies | hateful_replies | lagged_hateful_replies |
|---|---|---|---|---|---|---|
| 2018-01 | !ronman | 11 | 4 | 19 | 6 | 0 |
| 2021-06 | $$$ | 43 | 11 | 138 | 39 | 0 |
| 2021-07 | $$$ | 51 | 16 | 169 | 57 | 39 |
| 2020-06 | $toneman | 111 | 37 | 1 | 1 | 0 |
| 2020-07 | $toneman | 141 | 33 | 23 | 9 | 1 |
| 2020-08 | $toneman | 65 | 14 | 0 | 0 | 9 |
| 2021-08 | Farnham | 13 | 0 | 4 | 0 | 0 |
>Solution :
Using pd.shift will work fine in your case. Remember to group by user_name and set fill_value to 0.
df['lagged_hateful_replies'] = df.groupby('user_name')['hateful_replies'].shift(1, fill_value=0)