I need to split a column into two using regex and str.extract() (assuming this is best)
df = pd.DataFrame({
'Product': ['Truly Mix 2/12Pk Cans - 12Z',
'Bud 16Z - LOOSE - 16Z',
'Blue Moon (Case 12x - 22Z)',
'2 for the show (6/4PK - 16Z)']
})
I would like this result:
df_result = pd.DataFrame({
'Product': ['Truly Mix', 'Bud', 'Blue Moon', '2 for the show'],
'Packaging': ['2/12Pk Cans - 12Z',
'16Z - LOOSE - 16Z',
'Case 12x - 22Z',
'6/4PK - 16Z' ]
})
I tried a lot of things, but still struggle with regex, even after lots of online learning.
Here is my final attempt at getting the product:
pattern = r'(\D+)[^\w][^(Case][^0-9]'
df['Product'] = df['Product'].str.extract(pattern)
str.replace() should work fine for getting rid of the parenthesis, just can’t get that far.
I’m just not even close after 3 hours.
>Solution :
I’m assuming your product name boundary ends with either a number or a parenthesis. In that case, you can do the following to get the product names:
pattern = r'([^0-9(]+).*'
df['Product Name'] = df['Product'].str.extract(pattern)
df['Product Name'] = df['Product Name'].str.strip() # Remove spurious paces
# The packaging is the complementary pattern:
pattern = r'[^0-9(]+(.*)'
df['Packaging'] = df['Product'].str.extract(pattern)
df['Packaging'] = df['Packaging'].str.strip(' ()')
You can merge the two operations in one, if you want:
pattern = r'([^0-9(]+)(.*)'
df[['Product', 'Packaging']] = df['Product'].str.extract(pattern)
df['Product'] = df['Product'].str.strip()
df['Packaging'] = df['Packaging'].str.strip(' ()')