Regex: Match string with OR withour string delimiter

April 25, 2024

I’m extracting from bibtex and have a little problem, as the format can have values wrapped inside curly brackets OR NOT.
Please find the example text below:

@article{Roxas_2011, title={Social Desirability Bias in Survey Research on Sustainable Development in Small Firms: an Exploratory Analysis of Survey Mode Effect}, volume={21}, ISSN={1099-0836}, url={http://dx.doi.org/10.1002/bse.730}, DOI={10.1002/bse.730}, number={4}, journal={Business Strategy and the Environment}, publisher={Wiley}, author={Roxas, Banjo and Lindsay, Val}, year={2011}, month=sep, pages={223\xe2\x80\x93235} }

A you can see, all except month are x={y}, so a simple (PHP preg_match with mUg flags):

[\s,]+(.*)={(.*[^}])}

Does the trick for everything except month=sep.

If I try using ", " as delimited, it aparantly also splits authors.
Can you please help me? 🙂

Thanks 🙂

>Solution :

You can use

[\s,]+(.*?)=(?|{([^{}]*)}|(\w+))

Note you should not use any flags with the regex (you may use an s flag to make . match line break chars and you may use u flag to make \w and \s match all Unicode word/whitespace chars – if you need).

See the regex demo.

Details

[\s,]+ – one or more whitespaces or/and commas
(.*?) – Group 1: any zero or more chars other than line break chars as few as possible
= – a = char
(?|{([^{}]*)}|(\w+)) – a branch reset group matching:
- {([^{}]*)} – a { char, any zero or more chars other than { and } captured into Group 2, a } char.
- | – or
- (\w+) – Group 2: one or more word chars.