I have string data that I am matching to a list of terms and I want to create a new column that shows all words found in each row of the dataframe
df = pd.DataFrame({'String': ['Cat Dog Fish', 'Cat Dog', 'Pig Horse', 'DogFish']})
print(df)
String
0 Cat Dog Fish
1 Cat Dog
2 Pig Horse
3 DogFish
4 CatHorse
words = ['Cat', 'Dog', 'Fish']
I know using df.loc[df['String'].str.contains('|'.join(words))].copy() will return all rows that contain at least one of the terms I’m searching for but I would like output that keeps a record of which terms are found in each string like this:
String Matched
0 Cat Dog Fish [Cat, Dog, Fish]
1 Cat Dog [Cat, Dog]
2 DogFish [Dog, Fish]
3 CatHorse [Cat]
or even just
String Matched
0 Cat Dog Fish CatDogFish
1 Cat Dog CatDog
2 DogFish DogFish
3 CatHorse Cat
Not sure where to begin with making this column, any help is appreciated
>Solution :
Try:
words = ["Cat", "Dog", "Fish"]
df["Matched"] = df["String"].apply(lambda s: [w for w in words if w in s])
print(df)
Prints:
String Matched
0 Cat Dog Fish [Cat, Dog, Fish]
1 Cat Dog [Cat, Dog]
2 Pig Horse []
3 DogFish [Dog, Fish]