Follow

Keep Up to Date with the Most Important News

By pressing the Subscribe button, you confirm that you have read and are agreeing to our Privacy Policy and Terms of Use
Contact

How to fill in missing values in a panel data set

I have a data frame where each row summarises the monthly activity of forum members, including the number of posts, the number of replies, the number of hateful posts and the number of hateful replies they receive. It looks like this:

post_month user_name thread_posts hate_posts replies hateful_replies
2018-01 !ronman 11 4 19 6
2018-02 !ronman 43 11 138 39
2018-04 !ronman 51 16 169 57
2020-06 $toneman 111 37 1 1
2020-07 $toneman 141 33 23 9
2020-09 $toneman 65 14 0 0
2020-10 $toneman 13 0 4 0

However, notice in the post_month column that sometimes the rows skip a month if the forum member was inactive for that period. Hence, I wish to populate these absent months with rows that contain the relevant month in the post_month column and populate 0’s in every other column (apart from username).

Desired output is something like this:

MEDevel.com: Open-source for Healthcare and Education

Collecting and validating open-source software for healthcare, education, enterprise, development, medical imaging, medical records, and digital pathology.

Visit Medevel

post_month user_name thread_posts hate_posts replies hateful_replies
2018-01 !ronman 11 4 19 6
2018-02 !ronman 43 11 138 39
2018-03 !ronman 0 0 0 0
2018-04 !ronman 51 16 169 57
2020-06 $toneman 111 37 1 1
2020-07 $toneman 141 33 23 9
2020-08 $toneman 0 0 0 0
2020-09 $toneman 65 14 0 0
2020-10 $toneman 13 0 4 0

What would be the best way to approach this?

>Solution :

I would convert the post_month to datetime, then set dataframe index by it and use .asfreq() to fill the missing parts:

df["post_month"] = pd.to_datetime(df["post_month"])
df = df.set_index("post_month")

df = df.groupby("user_name", group_keys=False).apply(
    lambda x: (x := x.asfreq("MS")).assign(user_name=x["user_name"].ffill()).fillna(0)
).reset_index()
df['post_month'] = df['post_month'].dt.strftime('%Y-%m')

print(df)

Prints:

  post_month user_name  thread_posts  hate_posts  replies  hateful_replies
0    2018-01   !ronman          11.0         4.0     19.0              6.0
1    2018-02   !ronman          43.0        11.0    138.0             39.0
2    2018-03   !ronman           0.0         0.0      0.0              0.0
3    2018-04   !ronman          51.0        16.0    169.0             57.0
4    2020-06  $toneman         111.0        37.0      1.0              1.0
5    2020-07  $toneman         141.0        33.0     23.0              9.0
6    2020-08  $toneman           0.0         0.0      0.0              0.0
7    2020-09  $toneman          65.0        14.0      0.0              0.0
8    2020-10  $toneman          13.0         0.0      4.0              0.0
Add a comment

Leave a Reply

Keep Up to Date with the Most Important News

By pressing the Subscribe button, you confirm that you have read and are agreeing to our Privacy Policy and Terms of Use

Discover more from Dev solutions

Subscribe now to keep reading and get access to the full archive.

Continue reading