I have pandas dataframe as follows,
loc_1 loc_2
[mumbai, gujarat, sri lanka] [chennai, UP]
[Goa, telangana] [Kashmir, Goa, Rajkot]
NaN [Bihar, Orissa]
I want to create a new column that is a combination of both the above columns,
I did search other similar questions but the issue I am facing is that,
When I do,
data['locations'] = data['loc_1'] + data['loc_2']
Output
--------
loc_1 loc_2 locations
[mumbai, gujarat, sri lanka] [chennai, UP] [mumbai, gujarat, sri lanka,chennai, UP]
[Goa, telangana] [Kashmir, Goa, Rajkot] [Goa, telangana,Kashmir, Goa, Rajkot]
NaN [Bihar, Orissa] NaN
Issue
As you can see above, there are duplicate values as well as NaN values formed.
How to avoid them?
Remember
The original dataset contains values in list, str and NaN format.
Dataset:
loc = pd.DataFrame({
'loc_1': [['mumbai', 'gujarat', 'sri lanka'],['Goa', 'telangana'],np.nan],
'loc_2':[['chennai','UP'],['kashmir','goa','rajkot'],['bihar','orissa']],
'loc_3':['Chennai','Bangalore','Vizag']
})
>Solution :
First join values with replace NaNs (floats) to empty lists:
data['locations'] = data['loc_1'].apply(lambda x: [] if isinstance(x, float) else x) + data['loc_2']
And then remove duplicates with same order like original by converting to dictionaries by dict.fromkeys:
data['locations'] = data['locations'].apply(lambda x: list(dict.fromkeys(x)))
If the order is not important you can use a set:
data['locations'] = data['locations'].apply(lambda x: list(set(x)))