Follow

Keep Up to Date with the Most Important News

By pressing the Subscribe button, you confirm that you have read and are agreeing to our Privacy Policy and Terms of Use
Contact

Create column with year name if all following years meet condition

I have the following dataset

df = pd.DataFrame({
    'UID': [1, 1, 1, 2, 2, 2, 3, 3, 3],
    'Year': [2015, 2016, 2017, 2014, 2015, 2017, 2014, 2015, 2016],
    'Good?': [0, 1, 1, 0, 0, 1, 0, 1, 0]
})

for each UID, I am trying to figure out what is the first Year value whose respective ‘Good?’ value is 1 and also whose following Year values meet the condition ‘Good?’ as 1. In case the condition is not met, I would like to assign the value as 2017.

I seem to have some problems with the indexing as it throws a ‘KeyError: #####’ – I guess there are cases where I have only one year value and that is throwing an error. this is what I got so far.

MEDevel.com: Open-source for Healthcare and Education

Collecting and validating open-source software for healthcare, education, enterprise, development, medical imaging, medical records, and digital pathology.

Visit Medevel

# Group the DataFrame by UID
groups = df.groupby('UID')

# Initialize an empty list to store the results
results = []

# Loop over each UID group
for uid, group in groups:
    # Find the first index with a Good value of 1
    first_good_index = group[group['Good?'] == 1].index[0]
    print(first_good_index)
    
    # Check if all following years have a Good value of 1
    if (group.loc[first_good_index+1:, 'Good?'] == 1).all():
        # If so, append the UID and the year of the first good row to the results list
        results.append((uid, group.loc[first_good_index, 'Year']))
    else:
        results.append((uid, 2017))

    
# Create a DataFrame from the results
results_df = pd.DataFrame(results, columns=['UID', 'First Good Year'])

# Print the results
print(results_df)

these are the expected results

results_df = pd.DataFrame({
    'UID': [1, 2, 3],
    'First Good Year': [2016, 2017, 2017],
})

results_df

>Solution :

Use:

#test 1 values
m = df['Good?'].eq(1)

#test if all values after first 1 is not 1
mask = m.groupby(df['UID']).cummax() & ~m

#filter UIDs with only 1 in Good column
df1 = df[~df['UID'].isin(df.loc[mask, 'UID']) & m]
print (df1)
   UID  Year  Good?
1    1  2016      1
2    1  2017      1
5    2  2017      1

#get first `IUD` wth append missing `UID` filled by 2017
out = (df1.drop_duplicates('UID')
          .set_index('UID')['Year']
          .reindex(df['UID'].unique(), fill_value=2017)
          .reset_index())
print (out)
   UID  Year
0    1  2016
1    2  2017
2    3  2017
Add a comment

Leave a Reply

Keep Up to Date with the Most Important News

By pressing the Subscribe button, you confirm that you have read and are agreeing to our Privacy Policy and Terms of Use

Discover more from Dev solutions

Subscribe now to keep reading and get access to the full archive.

Continue reading