Follow

Keep Up to Date with the Most Important News

By pressing the Subscribe button, you confirm that you have read and are agreeing to our Privacy Policy and Terms of Use
Contact

Join 2 columns with list values in pandas(Avoiding duplicates and NaN)

I have pandas dataframe as follows,

loc_1                               loc_2                             

[mumbai, gujarat, sri lanka]        [chennai, UP]
[Goa, telangana]                    [Kashmir, Goa, Rajkot]
NaN                                 [Bihar, Orissa]

I want to create a new column that is a combination of both the above columns,
I did search other similar questions but the issue I am facing is that,

When I do,

MEDevel.com: Open-source for Healthcare and Education

Collecting and validating open-source software for healthcare, education, enterprise, development, medical imaging, medical records, and digital pathology.

Visit Medevel

data['locations'] = data['loc_1'] + data['loc_2']

Output
--------
loc_1                               loc_2                       locations

[mumbai, gujarat, sri lanka]        [chennai, UP]                [mumbai, gujarat, sri lanka,chennai, UP]   
[Goa, telangana]                    [Kashmir, Goa, Rajkot]       [Goa, telangana,Kashmir, Goa, Rajkot]
NaN                                 [Bihar, Orissa]              NaN

Issue

As you can see above, there are duplicate values as well as NaN values formed.
How to avoid them?

Remember

The original dataset contains values in list, str and NaN format.

Dataset:


loc = pd.DataFrame({
'loc_1': [['mumbai', 'gujarat', 'sri lanka'],['Goa', 'telangana'],np.nan],
'loc_2':[['chennai','UP'],['kashmir','goa','rajkot'],['bihar','orissa']],
'loc_3':['Chennai','Bangalore','Vizag']

})

>Solution :

First join values with replace NaNs (floats) to empty lists:

data['locations'] = data['loc_1'].apply(lambda x: [] if isinstance(x, float) else x) + data['loc_2']

And then remove duplicates with same order like original by converting to dictionaries by dict.fromkeys:

data['locations'] = data['locations'].apply(lambda x: list(dict.fromkeys(x)))

If the order is not important you can use a set:

data['locations'] = data['locations'].apply(lambda x: list(set(x)))
Add a comment

Leave a Reply

Keep Up to Date with the Most Important News

By pressing the Subscribe button, you confirm that you have read and are agreeing to our Privacy Policy and Terms of Use

Discover more from Dev solutions

Subscribe now to keep reading and get access to the full archive.

Continue reading