Follow

Keep Up to Date with the Most Important News

By pressing the Subscribe button, you confirm that you have read and are agreeing to our Privacy Policy and Terms of Use
Contact

Compare Same and Different in Two Columns of Dataframe

I have a small dataframe, like this.

import pandas as pd
import numpy as np
 
# data's stored in dictionary
details = {
    'address_id': [1, 1, 1, 2, 2],
    'business': ['verizon', 'verizon', 'comcast', 'sprint', 'att']
}
 
df = pd.DataFrame(details)
 

print(df)

enter image description here

I am trying to find out if, and when a person switched to a different cell phone service.

MEDevel.com: Open-source for Healthcare and Education

Collecting and validating open-source software for healthcare, education, enterprise, development, medical imaging, medical records, and digital pathology.

Visit Medevel

I tried this logic; didn’t work.

df['new'] = df.Column1.isin(df.Column1) & df[~df.Column2.isin(df.Column2)] 

Basically, given index row 0 and row 1, when the address_id was the same, the business was the same, but the business changed from verizon to comcast in index row 2. Also, given index row 3 and row 4, the address_id was the same, but the business changed from sprint to att in index row 4. I’d like to add a new column to the dataframe to flag these changes. How can I do that?

>Solution :

UPDATE: Here is an even simpler way than my original answer using join() (see below) to do what your question asks:

df['new'] = df.address_id.map(df.groupby('address_id').first().business) != df.business

Explanation:

  • Use groupby() and first() to create a dataframe whose business column contains the first one encountered for each address_id
  • Use Series.map() to transform the original dataframe’s address_id column into this first business value
  • Add column new which is True only if this new business differs from the original business column.

Here is a simple way to do what you’ve asked using groupby() and join():

df = df.join(df.groupby('address_id').first(), on='address_id', rsuffix='_first')
df = df.assign(new=df.business != df.business_first).drop(columns='business_first')f

Output:

   address_id business    new
0           1  verizon  False
1           1  verizon  False
2           1  comcast   True
3           2   sprint  False
4           2      att   True

Explanation:

  • Use groupby() and first() to create a dataframe whose business column contains the first one encountered for each address_id
  • Use join() to add a column business_first to df containing the corresponding first business for each address_id
  • Use assign() to add a column new containing a boolean indicating whether the row contains a new business with an existing address_id
  • Use drop() to eliminate the business_first column.
Add a comment

Leave a Reply

Keep Up to Date with the Most Important News

By pressing the Subscribe button, you confirm that you have read and are agreeing to our Privacy Policy and Terms of Use

Discover more from Dev solutions

Subscribe now to keep reading and get access to the full archive.

Continue reading