I have a very long and complicated Pandas dataframe in Python that looks something like this:
df_test = pd.DataFrame(data = None, columns = ['file','comment','number'])
df_test.file = ['file_1', 'file_1_v2', 'file_2', 'file_2_v2', 'file_3', 'file_3_v2']
df_test.comment = ['none: 5', 'Replacing: file_1', 'old', 'Replacing: file_2', '', 'Replacing: file_3']
df_test.number = ['12', '15', '13', '16', '14', '14']
The frame contains certain data files that have a number associated with them. However, there are also updated versions of those files that should have the same number as the old file but some have been assigned a new number instead.
What I want to do is to check the ‘comment’ column for each file, and if it starts with the string ‘Replacing: ‘, and the value in the ‘number’ column is not the same as the ‘number’ column for the dataset found in the string after ‘Replacing: ‘, the number should be put to be the same as the original file.
In this example, it means that the ‘number’ column should be changed to read:
[’12’, ’12’, ’13’, ’13’, ’14’, ’14’]
There are also some exception in the dataframe such as other comments which includes a colon or nan-values which must be considered as well. I can extract the files that should have the number replaced with the line below, but I’m not sure where to go from there. Any help is appreciated, thanks!
df_test_replace = df_test.loc[df_test.comment.str.startswith('Replacing: ')]
>Solution :
If you want to steer clear of regexes, you can manually parse your strings as well. Here I created a new column new_file
to make it easy for you to debug the old/new files.
df_test['new_file'] = df_test.loc[df_test.comment.str.startswith('Replacing: '),
'comment']\
str.removeprefix('Replacing: ')
df_test['new_file'] = df_test['new_file'].fillna(df_test['file'])
df_test['number'] = df_test.groupby('new_file')['number'].transform('min')
file comment number
0 file_1 none: 5 12
1 file_1_v2 Replacing: file_1 12
2 file_2 old 13
3 file_2_v2 Replacing: file_2 13
4 file_3 14
5 file_3_v2 Replacing: file_3 14