Follow

Keep Up to Date with the Most Important News

By pressing the Subscribe button, you confirm that you have read and are agreeing to our Privacy Policy and Terms of Use
Contact

Is there a neat way to use re.findall() while removing the first 4 characters of all matches? (python)

I’m extracting papers of arXiv by using the arXiv ids and using regex to help out with that. My current function is the following:

def get_arxiv_ids(bib_file_path):
    list_of_ids = []
    with open(bib_file_path, "r") as f:
        bib_string = f.read()

    arxiv_digits_list = re.findall(r"arXiv:\d{4}\.\d{4,5}", bib_string) <----
    for arxiv_id in arxiv_digits_list:
        list_of_ids.append(arxiv_id[6:])

    abs_digits_list = re.findall(r"abs/\d{4}\.\d{4,5}", bib_string) <---
    for abs_id in abs_digits_list:
        list_of_ids.append(abs_id[4:])

    print("Found {} arxiv ids in {}".format(len(list_of_ids), bib_file_path))
    return list_of_ids

I need to add arXiv: or abs/ otherwise I will extract some false positives. However, I was wondering if there was neater way to remove those characters from each match than to simply loop over each element. Not an issue with performance, but I was just curious.

MEDevel.com: Open-source for Healthcare and Education

Collecting and validating open-source software for healthcare, education, enterprise, development, medical imaging, medical records, and digital pathology.

Visit Medevel

>Solution :

You can use a regular expression capture group to capture only the desired part of the pattern. Groups are created using parenthesis:

arxiv_digits_list = re.findall(r"arXiv:(\d{4}\.\d{4,5})", bib_string)
abs_digits_list = re.findall(r"abs/(\d{4}\.\d{4,5})", bib_string)

When used with a single capture group, re.findall() "return[s] a list of strings matching that group."

BONUS (EDIT)

You can also combine your two regexes into one:

re.findall(r"(?:arXiv:|abs/)(\d{4}\.\d{4,5})", bib_string)

This adds a non-capture group that matches "arXiv:" OR "abs/"

Add a comment

Leave a Reply

Keep Up to Date with the Most Important News

By pressing the Subscribe button, you confirm that you have read and are agreeing to our Privacy Policy and Terms of Use

Discover more from Dev solutions

Subscribe now to keep reading and get access to the full archive.

Continue reading