Follow

Keep Up to Date with the Most Important News

By pressing the Subscribe button, you confirm that you have read and are agreeing to our Privacy Policy and Terms of Use
Contact

Pandas groupby with capture groups from extractall

I am working with Pandas extract() method to feed the capture group into groupby. A small example of the result of this process is something like the one illustrated below:

import pandas as pd
import re
from io import StringIO

DATA = StringIO('''
colA;colB;colC
Foo;1;1
Bar;2;2
Foo,Bar;3;3
''')
df = pd.read_csv(DATA, sep=';')

m1 = df['colA'].str.extract('(Bar|Foo)', flags=re.IGNORECASE, expand=False)

for t, _d in df.groupby(m1):
    t
    _d

# 'Bar'
#   colA  colB  colC
# 1  Bar     2     2
#
# 'Foo'
#       colA  colB  colC
# 0      Foo     1     1
# 2  Foo,Bar     3     3

However, the row with index 2 (third row) is only captured in the Foo group, whereas I wanted to capture it in both Foo and Bar.

Playing around with the extractall() method seems to capture all matched groups, but apparently cannot be used together with groupby() because pandas complaints about the grouper not being 1-dimensional: ValueError: Grouper for '<class 'pandas.core.frame.DataFrame'>' not 1-dimensional

MEDevel.com: Open-source for Healthcare and Education

Collecting and validating open-source software for healthcare, education, enterprise, development, medical imaging, medical records, and digital pathology.

Visit Medevel

m2 = df['colA'].str.extractall('(Bar|Foo)', flags=re.IGNORECASE)

#            0
#   match
# 0 0      Foo
# 1 0      Bar
# 2 0      Foo
#   1      Bar

The desired output for groupby() would be something like the following:

for t, _d in df.groupby(somematch):
    t
    _d

# 'Bar'
#       colA  colB  colC
# 1      Bar     2     2
# 2  Foo,Bar     3     3
#
# 'Foo'
#       colA  colB  colC
# 0      Foo     1     1
# 2  Foo,Bar     3     3

Any suggestions are welcome.

>Solution :

groupby is designed so that each row belongs to exactly one group, which is not true in your use-case. You’d need to extract the rows manually. One way to do so:

words = ['Foo','Bar']
for word in words:
    print(df.loc[df['colA'].str.contains(word)])

Output:

      colA  colB  colC
0      Foo     1     1
2  Foo,Bar     3     3
      colA  colB  colC
1      Bar     2     2
2  Foo,Bar     3     3

Or, if your data is given as in the example, you can use get_dummies to split the keywords within each cell:

keys = df['colA'].str.lower().str.get_dummies(',')
for key,data in keys.items():
    print(df.loc[data.eq(1)])
Add a comment

Leave a Reply

Keep Up to Date with the Most Important News

By pressing the Subscribe button, you confirm that you have read and are agreeing to our Privacy Policy and Terms of Use

Discover more from Dev solutions

Subscribe now to keep reading and get access to the full archive.

Continue reading