splitting a text by a capital letter after a small letter, without loosing the small letter

I have the following type of strings:
"CanadaUnited States",
"GermanyEnglandSpain"

I want to split them into the countries’ names, i.e.:

[‘Canada’, ‘United States’]
[‘Germany’, ‘England’, ‘Spain’]

I have tried using the following regex:

text = "GermanyEnglandSpain"
re.split('[a-z](?=[A-Z])', text)

and I’m getting:
['German', 'Englan', 'Spain']

How can I not lose the last char in every word?]
Thanks!

>Solution :

I would use re.findall here with a regex find all approach:

inp = "CanadaUnited States"
countries = re.findall(r'[A-Z][a-z]+(?: [A-Z][a-z]+)*', inp)
print(countries)  # ['Canada', 'United States']

The regex pattern used here says to match:

  • [A-Z][a-z]+ match a leading uppercase word of a country name
  • (?: [A-Z][a-z]+)* followed by space and another capital word, 0 or more times

Leave a Reply