How to extract last name while avoiding roman numerals

Advertisements

How to extract only last name (including hyphenated double last names) without roman numerals or other spaces or character?

String in Pandas dataframe representing person’s full name can take the following forms:

Jon Doe
Jon A. Doe
Jon Anderson Doe
Jon Doe II
Jon Doe, IV
Jon A. Doe, V
Jon A. Doe X
Jon Anderson Doe, VI
Jon Anderson Doe VII
Jon Anderson Doe-Stapleton VII
Jon Anderson Doe-Stapleton, VII
Jon Anderson Doe-Stapleton

Is regex a good solution? I’m obviously a novice, but would like an efficient solution.

Thank you for your help!

>Solution :

Try this to remove the roman numerals and comma:

import re

x = """Jon Doe
Jon A. Doe
Jon Anderson Doe
Jon Doe II
Jon Doe, IV
Jon A. Doe, V
Jon A. Doe X
Jon Anderson Doe, VI
Jon Anderson Doe VII
Jon Anderson Doe-Stapleton VII
Jon Anderson Doe-Stapleton, VII
Jon Anderson Doe-Stapleton""".split('\n')


for s in x:
  print(re.sub(r"(,.)?(M{0,4}(CM|CD|D?C{0,3})(XC|XL|L?X{0,3})(IX|IV|V?I{0,3}))$", "", s))

[out]:

Jon Doe
Jon A. Doe
Jon Anderson Doe
Jon Doe 
Jon Doe
Jon A. Doe
Jon A. Doe 
Jon Anderson Doe
Jon Anderson Doe 
Jon Anderson Doe-Stapleton 
Jon Anderson Doe-Stapleton
Jon Anderson Doe-Stapleton

Regex explanation: https://regex101.com/r/xeZpBD/1

Why do you need a complex regex for the roman numerals?

See https://regexr.com/3a406, cos not all IVXLCDM are valid roman numerals.

But how do we remove the last name?

Depends on how it’s defined. If it’s just the last token from the names, then you can just do this:

for s in x:
  print(re.sub(r"(,.)?(M{0,4}(CM|CD|D?C{0,3})(XC|XL|L?X{0,3})(IX|IV|V?I{0,3}))$", "", s).strip().split(' ')[-1])

[out]:

Doe
Doe
Doe
Doe
Doe
Doe
Doe
Doe
Doe
Doe-Stapleton
Doe-Stapleton
Doe-Stapleton

What if last name isn’t a single token/word?

E.g. https://en.wikipedia.org/wiki/Double-barrelled_name

The rugby player Rohan Janse van Rensburg‘s surname is Janse van Rensburg, not only van Rensburg (which is itself an existing surname).

or

Andrew Lloyd Webber, Baron Lloyd-Webber Kt (born 22 March 1948), is an English composer and impresario of musical theatre.

Shrugs, you need something more than regex for this, maybe a last name list?

Leave a ReplyCancel reply