I’m writing a Pandas script to perform data manipulation on an excel file. First, I load two sheets into dataframes. One is the original data df, the second is a sheet detailing replacements that need to be made in the original data replace.
The script needs to do two things for each row of df.
-
Replace each instance of
'Name'indfwith'NameReplace'(working) -
For the same rows in
df, replace a slice of the columns (specified by a list) with the values in the same slice of columns inreplace
Reproducible Minimal Example of my current implementation:
import pandas
df = pandas.DataFrame([["John", None, None],["Phil", None, None],["John", None, None],["Bob", None, None]], columns=["Name", "Age", "Height"])
replace = pandas.DataFrame([["John", "Dom", 25, 175],["Phil", "Kevin", 56, 145],["Bob", "Michael", 33, 180]], columns=["Name", "NameReplace", "Age", "Height"])
detailsList = ["Age", "Height"]
for i, row in replace.iterrows():
df.loc[df['Name'] == row['Name'], 'Name'] = row['NameReplace']
df.loc[df['Name'] == row['NameReplace'], detailsList] = row[detailsList]
print(df)
Step 1) is working with this implementation, but the detailsList columns in df do not get populated.
The current output is
Name Age Height
0 Dom NaN NaN
1 Kevin NaN NaN
2 Dom NaN NaN
3 Michael NaN NaN
The desired output is
Name Age Height
0 Dom 25 175
1 Kevin 56 145
2 Dom 25 175
3 Michael 33 180
I’ve been trying for a while now, and cannot seem to make progress. I also don’t really get why this doesn’t work, so any insight there would be extra appreciated!
Note: Using detailsList to specify the slice of columns is necessary, as in the real solution I am only operating on a specific slice of the full dataframe, unlike the example I’ve given.
>Solution :
The problem is the way that pandas tries to assign a series to a whole dataframe. Anyway, here’s a simple fix that leads to the intended behavior, taking advantage that pandas does the correct thing when you assign with a numpy array rather than with a series.
for i, row in replace.iterrows():
df.loc[df['Name'] == row['Name'], 'Name'] = row['NameReplace']
df.loc[df['Name'] == row['NameReplace'], detailsList] = row[detailsList].values
Other optimizations:
- Note that you can reuse the
df['Name'] == row['Name']mask. In particular, you save some work with
for i, row in replace.iterrows():
mask = df['Name'] == row['Name']
df.loc[mask, 'Name'] = row['NameReplace']
df.loc[mask, detailsList] = row[detailsList].values
- You can avoid iterrows if you use a merge
df = (df[['Name']].merge(replace, on = 'Name')
.drop(columns='Name')
.rename(columns={'NameReplace':'Name'}))
The catch with this approach is that the rows might end up reordered.