Follow

Keep Up to Date with the Most Important News

By pressing the Subscribe button, you confirm that you have read and are agreeing to our Privacy Policy and Terms of Use
Contact

Replace a value in a column in a Pandas dataframe if another column contains a certain string

I have a very long and complicated Pandas dataframe in Python that looks something like this:

df_test = pd.DataFrame(data = None, columns = ['file','comment','number'])
df_test.file = ['file_1', 'file_1_v2', 'file_2', 'file_2_v2', 'file_3', 'file_3_v2']
df_test.comment = ['none: 5', 'Replacing: file_1', 'old', 'Replacing: file_2', '', 'Replacing: file_3']
df_test.number = ['12', '15', '13', '16', '14', '14']

The frame contains certain data files that have a number associated with them. However, there are also updated versions of those files that should have the same number as the old file but some have been assigned a new number instead.

What I want to do is to check the ‘comment’ column for each file, and if it starts with the string ‘Replacing: ‘, and the value in the ‘number’ column is not the same as the ‘number’ column for the dataset found in the string after ‘Replacing: ‘, the number should be put to be the same as the original file.

MEDevel.com: Open-source for Healthcare and Education

Collecting and validating open-source software for healthcare, education, enterprise, development, medical imaging, medical records, and digital pathology.

Visit Medevel

In this example, it means that the ‘number’ column should be changed to read:

[’12’, ’12’, ’13’, ’13’, ’14’, ’14’]

There are also some exception in the dataframe such as other comments which includes a colon or nan-values which must be considered as well. I can extract the files that should have the number replaced with the line below, but I’m not sure where to go from there. Any help is appreciated, thanks!

df_test_replace = df_test.loc[df_test.comment.str.startswith('Replacing: ')]

>Solution :

If you want to steer clear of regexes, you can manually parse your strings as well. Here I created a new column new_file to make it easy for you to debug the old/new files.

df_test['new_file'] = df_test.loc[df_test.comment.str.startswith('Replacing: '), 
                                  'comment']\
                             str.removeprefix('Replacing: ')
df_test['new_file'] = df_test['new_file'].fillna(df_test['file'])
df_test['number'] = df_test.groupby('new_file')['number'].transform('min')

        file            comment number
0     file_1            none: 5     12
1  file_1_v2  Replacing: file_1     12
2     file_2                old     13
3  file_2_v2  Replacing: file_2     13
4     file_3                        14
5  file_3_v2  Replacing: file_3     14
Add a comment

Leave a Reply

Keep Up to Date with the Most Important News

By pressing the Subscribe button, you confirm that you have read and are agreeing to our Privacy Policy and Terms of Use

Discover more from Dev solutions

Subscribe now to keep reading and get access to the full archive.

Continue reading