Home Replace a value in a column in a Pandas dataframe if another column contains a certain string

Questions

Replace a value in a column in a Pandas dataframe if another column contains a certain string

May 10, 2023

I have a very long and complicated Pandas dataframe in Python that looks something like this:

df_test = pd.DataFrame(data = None, columns = ['file','comment','number'])
df_test.file = ['file_1', 'file_1_v2', 'file_2', 'file_2_v2', 'file_3', 'file_3_v2']
df_test.comment = ['none: 5', 'Replacing: file_1', 'old', 'Replacing: file_2', '', 'Replacing: file_3']
df_test.number = ['12', '15', '13', '16', '14', '14']

The frame contains certain data files that have a number associated with them. However, there are also updated versions of those files that should have the same number as the old file but some have been assigned a new number instead.

What I want to do is to check the ‘comment’ column for each file, and if it starts with the string ‘Replacing: ‘, and the value in the ‘number’ column is not the same as the ‘number’ column for the dataset found in the string after ‘Replacing: ‘, the number should be put to be the same as the original file.

In this example, it means that the ‘number’ column should be changed to read:

[’12’, ’12’, ’13’, ’13’, ’14’, ’14’]

There are also some exception in the dataframe such as other comments which includes a colon or nan-values which must be considered as well. I can extract the files that should have the number replaced with the line below, but I’m not sure where to go from there. Any help is appreciated, thanks!

df_test_replace = df_test.loc[df_test.comment.str.startswith('Replacing: ')]

>Solution :

If you want to steer clear of regexes, you can manually parse your strings as well. Here I created a new column new_file to make it easy for you to debug the old/new files.

df_test['new_file'] = df_test.loc[df_test.comment.str.startswith('Replacing: '), 
                                  'comment']\
                             str.removeprefix('Replacing: ')
df_test['new_file'] = df_test['new_file'].fillna(df_test['file'])
df_test['number'] = df_test.groupby('new_file')['number'].transform('min')

        file            comment number
0     file_1            none: 5     12
1  file_1_v2  Replacing: file_1     12
2     file_2                old     13
3  file_2_v2  Replacing: file_2     13
4     file_3                        14
5  file_3_v2  Replacing: file_3     14