Python regex to extract most recent digit preceding a keyword

’10. T. BESLEY, POLITICAL SELECTION. J. ECON. PERSPECT. 19, 43–60 (2005). 11. J. D. FEARON, CAMBRIDGE STUDIES IN THE THEORY OF DEMOCRACY, IN DEMOCRACY, ACCOUNTABILITY, AND REPRESENTATION, A. PRZEWORSKI, B. MANIN, S. C. STOKES, EDS. (CAMBRIDGE UNIV. PRESS, 1999), PP. 55–97. 12. B. B. DE MESQUITA, A. SMITH, THE DICTATOR’S HANDBOOK: WHY BAD BEHAVIOR IS ALMOST ALWAYS GOOD POLITICS (HACHETTE UK, 2011). 13. S. WONG, S. E. GUGGENHEIM, “COMMUNITY-DRIVEN DEVELOPMENT: MYTHS AND REALITIES” (WPS8435, THE WORLD BANK, 2018), PP. 1–36. 14. A. BEATH, F. CHRISTIA, R. ENIKOLOPOV, DIRECT DEMOCRACY AND RESOURCE ALLOCATION: EXPERIMENTAL EVIDENCE FROM AFGHANISTAN. J. DEV. ECON. 124, 199–213 (2017). 15. B. A. OLKEN, DIRECT DEMOCRACY AND LOCAL PUBLIC GOODS: EVIDENCE FROM A FIELD EXPERIMENT IN INDONESIA. AM. POLIT. SCI. REV. 104, 243–267 (2010). 16. A. BLAKE, M. J. GILLIGAN, INTERNATIONAL INTERVENTIONS TO BUILD SOCIAL CAPITAL: EVIDENCE FROM A FIELD EXPERIMENT IN SUDAN. AM. POLIT. SCI. REV. 109, 427–449 (2015)’

I have a list of references in text as shown abovd where the texts in bold is what I want to extract using re.findall(). Essentially, I would like to grab the reference number (here, 16) followed bu the citation in interest up to the citation’s published year (here, 2015). Because I have the first author’s last name in a list, I can use ‘BLAKE’ as a keyword, but everything else needs to be matched using regex.

So far I’ve tried:
re.findall('\d+?.*?BLAKE.*?\d{4}', refer, re.DOTALL)

But this grabs everything above, since \d+ matches ’10.’, not ’16.’. I thought .*? would minimize the string match between the digit and Blake, but it’s not. An alternative option is to give a range instead of .*, like re.findall('\d+?{0,5}BLAKE.*?\d{4}', refer, re.DOTALL) but I’m doing this for many other texts and I cannot know in advance how many texts there will be between the reference number and the first author’s last name.

Is there a way to grab the most recent digit (here, 16) preceding a keyword (BLAKE) here? Or a way to minimize the search between digit and a keyword?

>Solution :

If you’re guaranteed not to have any other digits in between the reference number and the "keyword" you’re searching for, the below should do the trick:

re.findall('\d+?[A-Z\.\s,]+BLAKE.*?\d{4}', text, re.DOTALL)

For an explanation of why this works, the expression [A-Z\.\s,]+ is a character class that will match any upper-case letter, the literal ., whitespace, and a comma.

UPDATE: I just now reread your question, and you said you wanted to extract the number only, not the entire reference. For that, Nick’s answer suffices. I’ll keep my answer here, though, in case it helps answer any other questions…

Leave a Reply