I have a long list of strings which are all random words, all of them capitalized, such as 'Pomegranate' and 'Yellow Banana'. However, some of them are stuck together, like so: 'AppleOrange'. There are no special characters or digits.
What I need is a regular expression on Python that matches 'Apple' and 'Orange' separately, but not 'Pomegranate' or 'Yellow'.
As expected, I’m very new to this, and I’ve only managed to write r"(?<!\s)([A-Z][a-z]*)"… But that still matches 'Yellow' and 'Pomegranate' . How do I do this?
>Solution :
If they all start with an uppercase char and optional lowercase chars, you can make use of lookarounds and an alternation to match both variations
(?<=[a-z])[A-Z][a-z]*|[A-Z][a-z]*(?=[A-Z])
The pattern matches:
(?<=[a-z])Assert a-z to the left[A-Z][a-z]*match A-Z and optional chars a-z|or[A-Z][a-z]*match A-Z and optional chars a-z(?=[A-Z])Assert A-Z to the right
Example
import re
pattern = r"(?<=[a-z])[A-Z][a-z]*|[A-Z][a-z]*(?=[A-Z])"
s = ("AppleOrange\nPomegranate Yellow Banana")
print(re.findall(pattern, s))
Output
['Apple', 'Orange']