I have the following type of strings:
"CanadaUnited States",
"GermanyEnglandSpain"
I want to split them into the countries’ names, i.e.:
[‘Canada’, ‘United States’]
[‘Germany’, ‘England’, ‘Spain’]
I have tried using the following regex:
text = "GermanyEnglandSpain"
re.split('[a-z](?=[A-Z])', text)
and I’m getting:
['German', 'Englan', 'Spain']
How can I not lose the last char in every word?]
Thanks!
>Solution :
I would use re.findall
here with a regex find all approach:
inp = "CanadaUnited States"
countries = re.findall(r'[A-Z][a-z]+(?: [A-Z][a-z]+)*', inp)
print(countries) # ['Canada', 'United States']
The regex pattern used here says to match:
[A-Z][a-z]+
match a leading uppercase word of a country name(?: [A-Z][a-z]+)*
followed by space and another capital word, 0 or more times