I am working with a dataset where I am separating the contents of one Excel column into 3 separate columns. A mock version of the data is as follows:
| Movie Titles/Category/Rating |
|---|
| Wolf of Wall Street A-13 x 9 |
| Django Unchained IMDB x 8 |
| The EXPL Haunted House FEAR x 7 |
| Silver Lining DC-23 x 8 |
This is what I want the results to look like:
| Title | Category | Rating |
|---|---|---|
| Wolf of Wall Street | A-13 | 9 |
| Django Unchained | IMDB | 8 |
| The EXPL Haunted House | FEAR | 7 |
| Silver Lining | DC-23 | 8 |
Here is the RegEx I used to successfully separate the cells:
For Rating, this RegEx worked:
data = [[Movie Titles/Category/Rating, Rating]] = data['Movie Titles/Category/Rating'].str.split(' x ', expand = True)
However, to separate Category from movie titles, this RegEx doesn’t work:
data['Category']=data['Movie Titles/Category/Rating'].str.extract('((\s[A-Z]{1,2}-\d{1,2})|(\s[A-Z]{4}$))', expand = True)
Since the uppercase letter pattern is present in the middle of the third cell as well (EXPL and I only want to separate FEAR into a separate column), the regex pattern ‘\s[A-Z]{4}$’ is not working. Is there a way to indicate in the RegEx pattern that I only want the uppercase text in the end of the table cell to separate (FEAR) and not the middle (EXPL)?
>Solution :
You can use
import pandas as pd
df = pd.DataFrame({'Movie Titles/Category/Rating':['Wolf of Wall Street A-13 x 9','Django Unchained IMDB x 8','The EXPL Haunted House FEAR x 7','Silver Lining DC-23 x 8']})
df2 = df['Movie Titles/Category/Rating'].str.extract(r'^(?P<Movie>.*?)\s+(?P<Category>\S+)\s+x\s+(?P<Rating>\d+)$', expand=True)
See the regex demo.
Details:
^– start of string(?P<Movie>.*?)– Group (Column) "Movie": any zero or more chars other than line break chars, as few as possible\s+– one or more whitespaces(?P<Category>\S+)– Group "Category": one or more non-whitespace chars\s+x\s+–xenclosed with one or more whitespaces(?P<Rating>\d+)– Group "Rating": one or more digits$– end of string.