Follow

Keep Up to Date with the Most Important News

By pressing the Subscribe button, you confirm that you have read and are agreeing to our Privacy Policy and Terms of Use
Contact

Pandas – add value to series indicating the first time each value in one series appears in each year

I have a dataframe that looks like the following:

Date/time File
2021-01-01 09:00:00 File A
2021-02-01 09:00:00 File A
2022-01-01 09:00:00 File A
2021-01-01 09:00:00 File B
2022-01-01 09:00:00 File B
2022-01-01 09:00:00 File C
2022-02-01 09:00:00 File C
2022-02-01 10:00:00 File C

What I am trying to do, is add a column for ‘First occurrence’ where each File is marked with a 1 for the first time it appears in each year

So the desired output would be:

MEDevel.com: Open-source for Healthcare and Education

Collecting and validating open-source software for healthcare, education, enterprise, development, medical imaging, medical records, and digital pathology.

Visit Medevel

Date/time File First occurrence
2021-01-01 09:00:00 File A 1
2021-02-01 09:00:00 File A
2022-01-01 09:00:00 File A 1
2021-01-01 09:00:00 File B 1
2022-01-01 09:00:00 File B 1
2022-01-01 09:00:00 File C 1
2022-02-01 09:00:00 File C
2022-02-01 10:00:00 File C

What I’ve tried so far which works:

df= df.sort_values(["File", "Date/time"])
df.reset_index(drop=True, inplace=True)
df["First occurrence"] = df.groupby(["File"]).cumcount() 
df["First occurrence"] = np.where(df["First scan"] == 0, 1, 0)

This works at the first occurrence for each file, but doesn’t work for each year. Building on that, I tried groupby methods and couldn’t get it working so I went with a brute force approach which is:

# Get list of years
year_list = df["Date/time"].dt.year.unique().tolist()

year_frames = []
for year in year_list:
    yr_df = df[df["Date/time"].dt.year==year]
    yr_df = yr_df.sort_values(["File", "Date/time"])
    yr_df = yr_df.reset_index(drop=True)

    yr_df["First scan"] = yr_df.groupby(["File"]).cumcount() 
    yr_df["First scan"] = np.where(yr_df["First scan"] == 0, 1, 0)
    
    year_frames.append(yr_df) 

final_df = pd.concat(year_frames, axis=0, ignore_index=True)

This works for each year df but the concat overwrites values of 1 and 0 as the two frames are joined.

I’d like to get better with Pandas so would rather not brute force it in this way if there’s a suitable way to do it but so far hours worth of Googling hasn’t give me much to go on!

>Solution :

You can create a year column and use DataFrame.duplicated to check the duplicated item in year and File columns

df['Date/time'] = pd.to_datetime(df['Date/time'])
df['year'] = df['Date/time'].dt.year
df['First occurrence'] = (~df.duplicated(['year', 'File'])).astype(int)
print(df)

            Date/time    File  year  First occurrence
0 2021-01-01 09:00:00  File A  2021                 1
1 2021-02-01 09:00:00  File A  2021                 0
2 2022-01-01 09:00:00  File A  2022                 1
3 2021-01-01 09:00:00  File B  2021                 1
4 2022-01-01 09:00:00  File B  2022                 1
5 2022-01-01 09:00:00  File C  2022                 1
6 2022-02-01 09:00:00  File C  2022                 0
7 2022-02-01 10:00:00  File C  2022                 0
Add a comment

Leave a Reply

Keep Up to Date with the Most Important News

By pressing the Subscribe button, you confirm that you have read and are agreeing to our Privacy Policy and Terms of Use

Discover more from Dev solutions

Subscribe now to keep reading and get access to the full archive.

Continue reading