Follow

Keep Up to Date with the Most Important News

By pressing the Subscribe button, you confirm that you have read and are agreeing to our Privacy Policy and Terms of Use
Contact

Python Regex to extract last name with suffixes Jr. and Sr

LastName = re.sub(r"(,.)?(M{0,4}(CM|CD|D?C{0,3})(XC|XL|L?X{0,3})(IX|IV|V?I{0,3}))$", "", Tbl['Full Name'][0]).strip().split(' ')[-1]

In Python, this regex works perfectly at extracting last names from full names except where the last name is followed by a suffix of "Jr." or "Sr."

An example name that is triggering a problem is Ronald N. McDonald, Jr.

What is solution to fix regex so that last names are extracted even with suffixes of Jr. and Sr.?

MEDevel.com: Open-source for Healthcare and Education

Collecting and validating open-source software for healthcare, education, enterprise, development, medical imaging, medical records, and digital pathology.

Visit Medevel

>Solution :

Assume you only want to get the pure last name no matter what a full name is composed of, i.e., ‘Jr.’, ‘Sr.’ would be discarded.

Code based on Assumption

Function

import re

def extract_last_name(full_name: str) -> str:
    pattern = r"(?:,\s?(?:Jr\.|Sr\.|M{0,4}(?:CM|CD|D?C{0,3})(?:XC|XL|L?X{0,3})(?:IX|IV|V?I{0,3}))|(?:\s(?:I{1,3}|I?V|VI{0,3})$))?$"

    last_name = re.sub(pattern, "", full_name).strip().split(' ')[-1]
    return last_name

Test cases

import unittest
from name_extractor import extract_last_name

class TestNameExtractor(unittest.TestCase):
    def test_extract_last_name(self):
        test_cases = [
            ("Ronald N. McDonald, Jr.", "McDonald"),
            ("John Smith, Sr.", "Smith"),
            ("Jane Doe", "Doe"),
            ("Alice Johnson III", "Johnson"),
            ("John P. Kennedy, IV", "Kennedy"),
        ]

        for full_name, expected_last_name in test_cases:
            result = extract_last_name(full_name)
            self.assertEqual(result, expected_last_name, f"Failed for '{full_name}': Expected '{expected_last_name}', got '{result}'")

if __name__ == '__main__':
    unittest.main()

Regex Explanation

  1. (?:...)?: This is a non-capturing group that is optional (due to the ? at the end). It is used to group elements without capturing the matched text.

  2. ,\s?: This matches an optional comma followed by an optional whitespace character. It is used to account for cases where the suffix is separated from the last name by a comma and/or a space, e.g. "John Smith, Jr." or "John Smith Jr.".

  3. (?:Jr\.|Sr\.): This is a non-capturing group that matches either "Jr." or "Sr.". It is used to handle the cases where the suffix is "Jr." or "Sr.".

  4. M{0,4}(?:CM|CD|D?C{0,3})(?:XC|XL|L?X{0,3})(?:IX|IV|V?I{0,3}): This part of the pattern matches Roman numerals. It is used to handle cases where the suffix is a Roman numeral.

  5. |(?:\s(?:I{1,3}|I?V|VI{0,3})$): This is an alternative part of the pattern, separated by |, which matches an optional space followed by a Roman numeral at the end of the string. This handles cases where the suffix is a Roman numeral but without a comma separating it from the last name.

Add a comment

Leave a Reply

Keep Up to Date with the Most Important News

By pressing the Subscribe button, you confirm that you have read and are agreeing to our Privacy Policy and Terms of Use

Discover more from Dev solutions

Subscribe now to keep reading and get access to the full archive.

Continue reading