Home What Python RegEx can I use to indicate a pattern only in the end of an Excel cell

Questions

What Python RegEx can I use to indicate a pattern only in the end of an Excel cell

November 17, 2021

I am working with a dataset where I am separating the contents of one Excel column into 3 separate columns. A mock version of the data is as follows:

Movie Titles/Category/Rating
Wolf of Wall Street A-13 x 9
Django Unchained IMDB x 8
The EXPL Haunted House FEAR x 7
Silver Lining DC-23 x 8

This is what I want the results to look like:

Title	Category	Rating
Wolf of Wall Street	A-13	9
Django Unchained	IMDB	8
The EXPL Haunted House	FEAR	7
Silver Lining	DC-23	8

Here is the RegEx I used to successfully separate the cells:
For Rating, this RegEx worked:

data = [[Movie Titles/Category/Rating, Rating]] = data['Movie Titles/Category/Rating'].str.split(' x ', expand = True)

However, to separate Category from movie titles, this RegEx doesn’t work:

data['Category']=data['Movie Titles/Category/Rating'].str.extract('((\s[A-Z]{1,2}-\d{1,2})|(\s[A-Z]{4}$))', expand = True)

Since the uppercase letter pattern is present in the middle of the third cell as well (EXPL and I only want to separate FEAR into a separate column), the regex pattern ‘\s[A-Z]{4}$’ is not working. Is there a way to indicate in the RegEx pattern that I only want the uppercase text in the end of the table cell to separate (FEAR) and not the middle (EXPL)?

>Solution :

You can use

import pandas as pd

df = pd.DataFrame({'Movie Titles/Category/Rating':['Wolf of Wall Street A-13 x 9','Django Unchained IMDB x 8','The EXPL Haunted House FEAR x 7','Silver Lining DC-23 x 8']})

df2 = df['Movie Titles/Category/Rating'].str.extract(r'^(?P<Movie>.*?)\s+(?P<Category>\S+)\s+x\s+(?P<Rating>\d+)$', expand=True)

See the regex demo.

Details:

^ – start of string
(?P<Movie>.*?) – Group (Column) "Movie": any zero or more chars other than line break chars, as few as possible
\s+ – one or more whitespaces
(?P<Category>\S+) – Group "Category": one or more non-whitespace chars
\s+x\s+ – x enclosed with one or more whitespaces
(?P<Rating>\d+) – Group "Rating": one or more digits
$ – end of string.