Compare Same and Different in Two Columns of Dataframe

August 30, 2022

I have a small dataframe, like this.

import pandas as pd
import numpy as np
 
# data's stored in dictionary
details = {
    'address_id': [1, 1, 1, 2, 2],
    'business': ['verizon', 'verizon', 'comcast', 'sprint', 'att']
}
 
df = pd.DataFrame(details)
 

print(df)

I am trying to find out if, and when a person switched to a different cell phone service.

I tried this logic; didn’t work.

df['new'] = df.Column1.isin(df.Column1) & df[~df.Column2.isin(df.Column2)]

Basically, given index row 0 and row 1, when the address_id was the same, the business was the same, but the business changed from verizon to comcast in index row 2. Also, given index row 3 and row 4, the address_id was the same, but the business changed from sprint to att in index row 4. I’d like to add a new column to the dataframe to flag these changes. How can I do that?

>Solution :

UPDATE: Here is an even simpler way than my original answer using join() (see below) to do what your question asks:

df['new'] = df.address_id.map(df.groupby('address_id').first().business) != df.business

Explanation:

Use groupby() and first() to create a dataframe whose business column contains the first one encountered for each address_id
Use Series.map() to transform the original dataframe’s address_id column into this first business value
Add column new which is True only if this new business differs from the original business column.

Here is a simple way to do what you’ve asked using groupby() and join():

df = df.join(df.groupby('address_id').first(), on='address_id', rsuffix='_first')
df = df.assign(new=df.business != df.business_first).drop(columns='business_first')f

Output:

   address_id business    new
0           1  verizon  False
1           1  verizon  False
2           1  comcast   True
3           2   sprint  False
4           2      att   True

Explanation:

Use groupby() and first() to create a dataframe whose business column contains the first one encountered for each address_id
Use join() to add a column business_first to df containing the corresponding first business for each address_id
Use assign() to add a column new containing a boolean indicating whether the row contains a new business with an existing address_id
Use drop() to eliminate the business_first column.