Follow

Keep Up to Date with the Most Important News

By pressing the Subscribe button, you confirm that you have read and are agreeing to our Privacy Policy and Terms of Use
Contact

str.extract() with regex

I need to split a column into two using regex and str.extract() (assuming this is best)

    df = pd.DataFrame({
                        'Product': ['Truly Mix 2/12Pk Cans - 12Z',
                                    'Bud 16Z - LOOSE -  16Z',
                                    'Blue Moon (Case 12x - 22Z)',
                                    '2 for the show (6/4PK - 16Z)']
             })

I would like this result:

df_result = pd.DataFrame({
                          'Product': ['Truly Mix', 'Bud', 'Blue Moon', '2 for the show'],
                          'Packaging': ['2/12Pk Cans - 12Z',
                                        '16Z - LOOSE -  16Z',
                                        'Case 12x - 22Z',
                                        '6/4PK - 16Z' ]
                 })

I tried a lot of things, but still struggle with regex, even after lots of online learning.

MEDevel.com: Open-source for Healthcare and Education

Collecting and validating open-source software for healthcare, education, enterprise, development, medical imaging, medical records, and digital pathology.

Visit Medevel

Here is my final attempt at getting the product:

pattern = r'(\D+)[^\w][^(Case][^0-9]'

df['Product'] = df['Product'].str.extract(pattern)

str.replace() should work fine for getting rid of the parenthesis, just can’t get that far.

I’m just not even close after 3 hours.

>Solution :

I’m assuming your product name boundary ends with either a number or a parenthesis. In that case, you can do the following to get the product names:

pattern = r'([^0-9(]+).*'
df['Product Name'] = df['Product'].str.extract(pattern)
df['Product Name'] = df['Product Name'].str.strip()   # Remove spurious paces

# The packaging is the complementary pattern:
pattern = r'[^0-9(]+(.*)'
df['Packaging'] = df['Product'].str.extract(pattern)
df['Packaging'] = df['Packaging'].str.strip(' ()')

You can merge the two operations in one, if you want:

pattern = r'([^0-9(]+)(.*)'
df[['Product', 'Packaging']] = df['Product'].str.extract(pattern)

df['Product'] = df['Product'].str.strip()
df['Packaging'] = df['Packaging'].str.strip(' ()')
Add a comment

Leave a Reply

Keep Up to Date with the Most Important News

By pressing the Subscribe button, you confirm that you have read and are agreeing to our Privacy Policy and Terms of Use

Discover more from Dev solutions

Subscribe now to keep reading and get access to the full archive.

Continue reading