I am working with Pandas extract() method to feed the capture group into groupby. A small example of the result of this process is something like the one illustrated below:
import pandas as pd
import re
from io import StringIO
DATA = StringIO('''
colA;colB;colC
Foo;1;1
Bar;2;2
Foo,Bar;3;3
''')
df = pd.read_csv(DATA, sep=';')
m1 = df['colA'].str.extract('(Bar|Foo)', flags=re.IGNORECASE, expand=False)
for t, _d in df.groupby(m1):
t
_d
# 'Bar'
# colA colB colC
# 1 Bar 2 2
#
# 'Foo'
# colA colB colC
# 0 Foo 1 1
# 2 Foo,Bar 3 3
However, the row with index 2 (third row) is only captured in the Foo group, whereas I wanted to capture it in both Foo and Bar.
Playing around with the extractall() method seems to capture all matched groups, but apparently cannot be used together with groupby() because pandas complaints about the grouper not being 1-dimensional: ValueError: Grouper for '<class 'pandas.core.frame.DataFrame'>' not 1-dimensional
m2 = df['colA'].str.extractall('(Bar|Foo)', flags=re.IGNORECASE)
# 0
# match
# 0 0 Foo
# 1 0 Bar
# 2 0 Foo
# 1 Bar
The desired output for groupby() would be something like the following:
for t, _d in df.groupby(somematch):
t
_d
# 'Bar'
# colA colB colC
# 1 Bar 2 2
# 2 Foo,Bar 3 3
#
# 'Foo'
# colA colB colC
# 0 Foo 1 1
# 2 Foo,Bar 3 3
Any suggestions are welcome.
>Solution :
groupby is designed so that each row belongs to exactly one group, which is not true in your use-case. You’d need to extract the rows manually. One way to do so:
words = ['Foo','Bar']
for word in words:
print(df.loc[df['colA'].str.contains(word)])
Output:
colA colB colC
0 Foo 1 1
2 Foo,Bar 3 3
colA colB colC
1 Bar 2 2
2 Foo,Bar 3 3
Or, if your data is given as in the example, you can use get_dummies to split the keywords within each cell:
keys = df['colA'].str.lower().str.get_dummies(',')
for key,data in keys.items():
print(df.loc[data.eq(1)])