Python Regex to extract last name with suffixes Jr. and Sr

LastName = re.sub(r"(,.)?(M{0,4}(CM|CD|D?C{0,3})(XC|XL|L?X{0,3})(IX|IV|V?I{0,3}))$", "", Tbl['Full Name'][0]).strip().split(' ')[-1]

In Python, this regex works perfectly at extracting last names from full names except where the last name is followed by a suffix of "Jr." or "Sr."

An example name that is triggering a problem is Ronald N. McDonald, Jr.

What is solution to fix regex so that last names are extracted even with suffixes of Jr. and Sr.?

>Solution :

Assume you only want to get the pure last name no matter what a full name is composed of, i.e., ‘Jr.’, ‘Sr.’ would be discarded.

Code based on Assumption

Function

import re

def extract_last_name(full_name: str) -> str:
    pattern = r"(?:,\s?(?:Jr\.|Sr\.|M{0,4}(?:CM|CD|D?C{0,3})(?:XC|XL|L?X{0,3})(?:IX|IV|V?I{0,3}))|(?:\s(?:I{1,3}|I?V|VI{0,3})$))?$"

    last_name = re.sub(pattern, "", full_name).strip().split(' ')[-1]
    return last_name

Test cases

import unittest
from name_extractor import extract_last_name

class TestNameExtractor(unittest.TestCase):
    def test_extract_last_name(self):
        test_cases = [
            ("Ronald N. McDonald, Jr.", "McDonald"),
            ("John Smith, Sr.", "Smith"),
            ("Jane Doe", "Doe"),
            ("Alice Johnson III", "Johnson"),
            ("John P. Kennedy, IV", "Kennedy"),
        ]

        for full_name, expected_last_name in test_cases:
            result = extract_last_name(full_name)
            self.assertEqual(result, expected_last_name, f"Failed for '{full_name}': Expected '{expected_last_name}', got '{result}'")

if __name__ == '__main__':
    unittest.main()

Regex Explanation

  1. (?:...)?: This is a non-capturing group that is optional (due to the ? at the end). It is used to group elements without capturing the matched text.

  2. ,\s?: This matches an optional comma followed by an optional whitespace character. It is used to account for cases where the suffix is separated from the last name by a comma and/or a space, e.g. "John Smith, Jr." or "John Smith Jr.".

  3. (?:Jr\.|Sr\.): This is a non-capturing group that matches either "Jr." or "Sr.". It is used to handle the cases where the suffix is "Jr." or "Sr.".

  4. M{0,4}(?:CM|CD|D?C{0,3})(?:XC|XL|L?X{0,3})(?:IX|IV|V?I{0,3}): This part of the pattern matches Roman numerals. It is used to handle cases where the suffix is a Roman numeral.

  5. |(?:\s(?:I{1,3}|I?V|VI{0,3})$): This is an alternative part of the pattern, separated by |, which matches an optional space followed by a Roman numeral at the end of the string. This handles cases where the suffix is a Roman numeral but without a comma separating it from the last name.

Leave a Reply