Follow

Keep Up to Date with the Most Important News

By pressing the Subscribe button, you confirm that you have read and are agreeing to our Privacy Policy and Terms of Use
Contact

Regex for matching only capitalized words stuck together (i.e. not separated by whitespace)

I have a long list of strings which are all random words, all of them capitalized, such as 'Pomegranate' and 'Yellow Banana'. However, some of them are stuck together, like so: 'AppleOrange'. There are no special characters or digits.

What I need is a regular expression on Python that matches 'Apple' and 'Orange' separately, but not 'Pomegranate' or 'Yellow'.

As expected, I’m very new to this, and I’ve only managed to write r"(?<!\s)([A-Z][a-z]*)"… But that still matches 'Yellow' and 'Pomegranate' . How do I do this?

MEDevel.com: Open-source for Healthcare and Education

Collecting and validating open-source software for healthcare, education, enterprise, development, medical imaging, medical records, and digital pathology.

Visit Medevel

>Solution :

If they all start with an uppercase char and optional lowercase chars, you can make use of lookarounds and an alternation to match both variations

(?<=[a-z])[A-Z][a-z]*|[A-Z][a-z]*(?=[A-Z])

The pattern matches:

  • (?<=[a-z]) Assert a-z to the left
  • [A-Z][a-z]* match A-Z and optional chars a-z
  • | or
  • [A-Z][a-z]* match A-Z and optional chars a-z
  • (?=[A-Z]) Assert A-Z to the right

Regex demo

Example

import re

pattern = r"(?<=[a-z])[A-Z][a-z]*|[A-Z][a-z]*(?=[A-Z])"
s = ("AppleOrange\nPomegranate Yellow Banana")

print(re.findall(pattern, s))

Output

['Apple', 'Orange']
Add a comment

Leave a Reply

Keep Up to Date with the Most Important News

By pressing the Subscribe button, you confirm that you have read and are agreeing to our Privacy Policy and Terms of Use

Discover more from Dev solutions

Subscribe now to keep reading and get access to the full archive.

Continue reading