Follow

Keep Up to Date with the Most Important News

By pressing the Subscribe button, you confirm that you have read and are agreeing to our Privacy Policy and Terms of Use
Contact

How to replace string values of a Pandas column with a string except some values?

Example DataFrame:

import pandas as pd
df = pd.DataFrame({'Age' : [70.0, 58.0, 44.0, 40.0, 21.0, 35.0, 12.0, 43.0, 45.0, 65.0, 56.0, 31.0, 30.0,
                            52.0, 59.0, 52.0, 31.0, 55.0, 42.0, 73.0],
                   'MarketSegment' : ['Travel Agent/Operator', 'Other', 'Other', 'Other', 'Other',
                                      'Direct', 'Groups', 'Other', 'Other', 'Direct', 'Other',
                                      'Other', 'Other', 'Other', 'Groups', 'Groups', 'Other', 'Other',
                                      'Groups', 'Other'],
                   'Nationality' : ['CAN', 'ESP', 'FRA', 'DEU', 'GBR', 'RUS', 'IRL', 'FRA', 'IRL',
                                    'BRA', 'LTU', 'CHE', 'FRA', 'GBR', 'FRA', 'PRT', 'DEU', 'ESP',
                                    'CHE', 'USA']})

First, I only want the top 3 most common nationalities. I used the code below:

top_nat = df.groupby('Nationality').count().sort_values \
(by='Age', ascending = False).head(3).iloc[:, 0].index.to_list()

(Is there any way to do it using only the frequency of unique values in the ‘Nationalities’ column? And by not using any other column, like ‘Age’?)

MEDevel.com: Open-source for Healthcare and Education

Collecting and validating open-source software for healthcare, education, enterprise, development, medical imaging, medical records, and digital pathology.

Visit Medevel

Now I want all the values in ‘Nationalities’ to be replaced by ‘OTR’ except the values == those in top_nat. I tried stuff like these:

df['Nationality'].replace(~top_nat,'OTR', inplace=True)

df["Nationality"] = df["Nationality"].apply(lambda x: x.replace(~top_nat, "OTR"))

for x in top_nat:
    df.loc[df['Nationality'] != x, 'Nationality'] = 'OTR'

Nothing’s working. Maybe I want something like:

if values in df.Nationality != values in top_nat:
    replace that value in df.Nationality with 'OTR'
else:
    continue

Shape of original dataset is (82580, 30) and I need top 15 nationalities. Please help.

>Solution :

First get the top 3:

top = df['Nationality'].value_counts().nlargest(3).index

then set the nationality

df.loc[~df['Nationality'].isin(top), 'Nationality'] = 'OTR'

This leaves the top 3 nationalities as they are and replaces everything else to ‘OTR’

Add a comment

Leave a Reply

Keep Up to Date with the Most Important News

By pressing the Subscribe button, you confirm that you have read and are agreeing to our Privacy Policy and Terms of Use

Discover more from Dev solutions

Subscribe now to keep reading and get access to the full archive.

Continue reading