I have a text file with entries like:
2: Adcock R, Cuzick J, Hunt WC, McDonald RM, Wheeler CM; New Mexico HPV Pap
Registry Steering Committee. Role of HPV Genotype, Multiple Infections, and
Viral Load on the Risk of High-Grade Cervical Neoplasia. Cancer Epidemiol
Biomarkers Prev. 2019 Nov;28(11):1816-1824. doi: 10.1158/1055-9965.EPI-19-0239.
Epub 2019 Sep 5. PMID: 31488417; PMCID: PMC8394698.
3: Castle PE, Adcock R, Cuzick J, Wentzensen N, Torrez-Martinez NE, Torres SM,
Stoler MH, Ronnett BM, Joste NE, Darragh TM, Gravitt PE, Schiffman M, Hunt WC,
Kinney WK, Wheeler CM; New Mexico HPV Pap Registry Steering Committee; p16 IHC
Study Panel. Relationships of p16 Immunohistochemistry and Other Biomarkers With
Diagnoses of Cervical Abnormalities: Implications for LAST Terminology. Arch
Pathol Lab Med. 2020 Jun;144(6):725-734. doi: 10.5858/arpa.2019-0241-OA. Epub
2019 Nov 13. PMID: 31718233; PMCID: PMC8575174.
I want to create a python program that will allow me to extract all those numbers that follow "PMID:"
I tried:
import re
path = 'summaryCosetteWheset.txt'
pmidsFile = open(path, 'r')
info = pmidsFile.read()
print(info)
pmidsList = re.findall(r'PMID: (\d)+;', info)
print(pmidsList)
But I am only getting digits not numbers like 31718233. Is there a way to do this? Thanks
PD: Just started with Python3
>Solution :
You need to move the + inside the capturing group to capture all the digits in each match.
pmidsList = re.findall(r'PMID: (\d+);', info)
With the + outside the capturing group, only the last digit matched in each group will be retained.