LastName = re.sub(r"(,.)?(M{0,4}(CM|CD|D?C{0,3})(XC|XL|L?X{0,3})(IX|IV|V?I{0,3}))$", "", Tbl['Full Name'][0]).strip().split(' ')[-1]
In Python, this regex works perfectly at extracting last names from full names except where the last name is followed by a suffix of "Jr." or "Sr."
An example name that is triggering a problem is Ronald N. McDonald, Jr.
What is solution to fix regex so that last names are extracted even with suffixes of Jr. and Sr.?
>Solution :
Assume you only want to get the pure last name no matter what a full name is composed of, i.e., ‘Jr.’, ‘Sr.’ would be discarded.
Code based on Assumption
Function
import re
def extract_last_name(full_name: str) -> str:
pattern = r"(?:,\s?(?:Jr\.|Sr\.|M{0,4}(?:CM|CD|D?C{0,3})(?:XC|XL|L?X{0,3})(?:IX|IV|V?I{0,3}))|(?:\s(?:I{1,3}|I?V|VI{0,3})$))?$"
last_name = re.sub(pattern, "", full_name).strip().split(' ')[-1]
return last_name
Test cases
import unittest
from name_extractor import extract_last_name
class TestNameExtractor(unittest.TestCase):
def test_extract_last_name(self):
test_cases = [
("Ronald N. McDonald, Jr.", "McDonald"),
("John Smith, Sr.", "Smith"),
("Jane Doe", "Doe"),
("Alice Johnson III", "Johnson"),
("John P. Kennedy, IV", "Kennedy"),
]
for full_name, expected_last_name in test_cases:
result = extract_last_name(full_name)
self.assertEqual(result, expected_last_name, f"Failed for '{full_name}': Expected '{expected_last_name}', got '{result}'")
if __name__ == '__main__':
unittest.main()
Regex Explanation
-
(?:...)?
: This is a non-capturing group that is optional (due to the ? at the end). It is used to group elements without capturing the matched text. -
,\s?
: This matches an optional comma followed by an optional whitespace character. It is used to account for cases where the suffix is separated from the last name by a comma and/or a space, e.g. "John Smith, Jr." or "John Smith Jr.". -
(?:Jr\.|Sr\.)
: This is a non-capturing group that matches either "Jr." or "Sr.". It is used to handle the cases where the suffix is "Jr." or "Sr.". -
M{0,4}(?:CM|CD|D?C{0,3})(?:XC|XL|L?X{0,3})(?:IX|IV|V?I{0,3})
: This part of the pattern matches Roman numerals. It is used to handle cases where the suffix is a Roman numeral. -
|(?:\s(?:I{1,3}|I?V|VI{0,3})$)
: This is an alternative part of the pattern, separated by|
, which matches an optional space followed by a Roman numeral at the end of the string. This handles cases where the suffix is a Roman numeral but without a comma separating it from the last name.