I have a dataframe with many columns representing football match stats.
Since I’ve scraped the data, I have two rows per match and I’m trying to remove duplicates.
I tried executing:
df.drop(all_matches[df['HomeTeam'] != df['team']])
to remove instances where values in the HomeTeam columns don’t match the team columns.
I got the following error:
KeyError: "[‘date’ ‘time’ ‘comp’ ’round’ ‘day’ ‘venue’ ‘result’
‘HomeTeam’ ‘AwayTeam’\n ‘gf’ ‘ga’ ‘opponent’ ‘poss’ ‘attendance’
‘captain’ ‘formation’ ‘referee’\n ‘match report’ ‘notes’ ‘sh’ ‘sot’
‘pk’ ‘pkatt’ ‘venue_vs’ ‘result_vs’\n ‘team_shots_vs’
‘team_shots_ot_vs’ ‘pk_vs’ ‘pkatt_vs’ ‘season’ ‘team’] not found in
axis"
I don’t understand why it isn’t working and why some column names are followed by a new line character after the quotes
>Solution :
To remove duplicates from the DataFrame based on the condition ‘HomeTeam’ matches ‘team’ for each match, you’ll need to use the drop_duplicates method in pandas.
df.drop(all_matches[df['HomeTeam'] != df['team']]) is not suitable for this purpose because it’s attempting to drop columns which doesn’t exist in the DataFrame, which also results in a KeyError.
Instead, you can use:
df_cleaned = df.drop_duplicates(subset=['team', 'HomeTeam'])
For more information regarding drop_duplicates method, you can refer the document :
https://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.drop_duplicates.html