Follow

Keep Up to Date with the Most Important News

By pressing the Subscribe button, you confirm that you have read and are agreeing to our Privacy Policy and Terms of Use
Contact

What Python RegEx can I use to indicate a pattern only in the end of an Excel cell

I am working with a dataset where I am separating the contents of one Excel column into 3 separate columns. A mock version of the data is as follows:

Movie Titles/Category/Rating
Wolf of Wall Street A-13 x 9
Django Unchained IMDB x 8
The EXPL Haunted House FEAR x 7
Silver Lining DC-23 x 8

This is what I want the results to look like:

Title Category Rating
Wolf of Wall Street A-13 9
Django Unchained IMDB 8
The EXPL Haunted House FEAR 7
Silver Lining DC-23 8

Here is the RegEx I used to successfully separate the cells:
For Rating, this RegEx worked:

MEDevel.com: Open-source for Healthcare and Education

Collecting and validating open-source software for healthcare, education, enterprise, development, medical imaging, medical records, and digital pathology.

Visit Medevel

data = [[Movie Titles/Category/Rating, Rating]] = data['Movie Titles/Category/Rating'].str.split(' x ', expand = True)

However, to separate Category from movie titles, this RegEx doesn’t work:

data['Category']=data['Movie Titles/Category/Rating'].str.extract('((\s[A-Z]{1,2}-\d{1,2})|(\s[A-Z]{4}$))', expand = True)

Since the uppercase letter pattern is present in the middle of the third cell as well (EXPL and I only want to separate FEAR into a separate column), the regex pattern ‘\s[A-Z]{4}$’ is not working. Is there a way to indicate in the RegEx pattern that I only want the uppercase text in the end of the table cell to separate (FEAR) and not the middle (EXPL)?

>Solution :

You can use

import pandas as pd

df = pd.DataFrame({'Movie Titles/Category/Rating':['Wolf of Wall Street A-13 x 9','Django Unchained IMDB x 8','The EXPL Haunted House FEAR x 7','Silver Lining DC-23 x 8']})

df2 = df['Movie Titles/Category/Rating'].str.extract(r'^(?P<Movie>.*?)\s+(?P<Category>\S+)\s+x\s+(?P<Rating>\d+)$', expand=True)

See the regex demo.

Details:

  • ^ – start of string
  • (?P<Movie>.*?) – Group (Column) "Movie": any zero or more chars other than line break chars, as few as possible
  • \s+ – one or more whitespaces
  • (?P<Category>\S+) – Group "Category": one or more non-whitespace chars
  • \s+x\s+x enclosed with one or more whitespaces
  • (?P<Rating>\d+) – Group "Rating": one or more digits
  • $ – end of string.
Add a comment

Leave a Reply

Keep Up to Date with the Most Important News

By pressing the Subscribe button, you confirm that you have read and are agreeing to our Privacy Policy and Terms of Use

Discover more from Dev solutions

Subscribe now to keep reading and get access to the full archive.

Continue reading