I have a data frame where each row summarises the monthly activity of forum members, including the number of posts, the number of replies, the number of hateful posts and the number of hateful replies they receive. It looks like this:
| post_month | user_name | thread_posts | hate_posts | replies | hateful_replies |
|---|---|---|---|---|---|
| 2018-01 | !ronman | 11 | 4 | 19 | 6 |
| 2018-02 | !ronman | 43 | 11 | 138 | 39 |
| 2018-04 | !ronman | 51 | 16 | 169 | 57 |
| 2020-06 | $toneman | 111 | 37 | 1 | 1 |
| 2020-07 | $toneman | 141 | 33 | 23 | 9 |
| 2020-09 | $toneman | 65 | 14 | 0 | 0 |
| 2020-10 | $toneman | 13 | 0 | 4 | 0 |
However, notice in the post_month column that sometimes the rows skip a month if the forum member was inactive for that period. Hence, I wish to populate these absent months with rows that contain the relevant month in the post_month column and populate 0’s in every other column (apart from username).
Desired output is something like this:
| post_month | user_name | thread_posts | hate_posts | replies | hateful_replies |
|---|---|---|---|---|---|
| 2018-01 | !ronman | 11 | 4 | 19 | 6 |
| 2018-02 | !ronman | 43 | 11 | 138 | 39 |
| 2018-03 | !ronman | 0 | 0 | 0 | 0 |
| 2018-04 | !ronman | 51 | 16 | 169 | 57 |
| 2020-06 | $toneman | 111 | 37 | 1 | 1 |
| 2020-07 | $toneman | 141 | 33 | 23 | 9 |
| 2020-08 | $toneman | 0 | 0 | 0 | 0 |
| 2020-09 | $toneman | 65 | 14 | 0 | 0 |
| 2020-10 | $toneman | 13 | 0 | 4 | 0 |
What would be the best way to approach this?
>Solution :
I would convert the post_month to datetime, then set dataframe index by it and use .asfreq() to fill the missing parts:
df["post_month"] = pd.to_datetime(df["post_month"])
df = df.set_index("post_month")
df = df.groupby("user_name", group_keys=False).apply(
lambda x: (x := x.asfreq("MS")).assign(user_name=x["user_name"].ffill()).fillna(0)
).reset_index()
df['post_month'] = df['post_month'].dt.strftime('%Y-%m')
print(df)
Prints:
post_month user_name thread_posts hate_posts replies hateful_replies
0 2018-01 !ronman 11.0 4.0 19.0 6.0
1 2018-02 !ronman 43.0 11.0 138.0 39.0
2 2018-03 !ronman 0.0 0.0 0.0 0.0
3 2018-04 !ronman 51.0 16.0 169.0 57.0
4 2020-06 $toneman 111.0 37.0 1.0 1.0
5 2020-07 $toneman 141.0 33.0 23.0 9.0
6 2020-08 $toneman 0.0 0.0 0.0 0.0
7 2020-09 $toneman 65.0 14.0 0.0 0.0
8 2020-10 $toneman 13.0 0.0 4.0 0.0