Follow

Keep Up to Date with the Most Important News

By pressing the Subscribe button, you confirm that you have read and are agreeing to our Privacy Policy and Terms of Use
Contact

How to RegEx between values that can be multiline or singleline

I have a long text from which I need to extract data. I am trying to use RegEx but with little success. I did my research, tried a lot of things, but it is not working.

The pattern should:

  • Find the string: "Adónem számlaszáma: "
  • Return the account number after that
  • Go backwards UNTIL the first word with 3 digits
  • Return that 3-letter code
  • Return the text between this code and the first string

Part of the text:

MEDevel.com: Open-source for Healthcare and Education

Collecting and validating open-source software for healthcare, education, enterprise, development, medical imaging, medical records, and digital pathology.

Visit Medevel

Időszak: 2021.01.01-2021.11.24
/
101 Társasági adó   Adónem számlaszáma: 10032000-01076019
2021.01.01.
Nyitóegyenleg

Pattern used:

*Flags used: global, single line*

(\b\d\d\d\b)( .*?)Adónem számlaszáma: (.*?)\n

Match is good:

match1

Another part of the text:

-13 000


    101 adónemen többlet:   5 000 Ft
104 Általános forgalmi adó  Adónem számlaszáma: 10032000-01076868

Same pattern used.

Match is not good:

match2

This is the full file I am working with: samplefile.txt

What am I missing? I have the lazy quantifier, dot matches newline etc… Thank you in advance.

>Solution :

If you do not need to match across lines, you may get it done with

\b\d{3}\b\s*(.*?)\s*Adónem számlaszáma: (\S*)

See this regex in action.

Otherwise, you would need to make sure there are no other 3-digit numbers between a 3-digit number and your fixed string:

\b\d{3}\b\s*((?:(?!\b\d{3}\b)[^])*?)\s*Adónem számlaszáma: (\S*)

See this demo. Let me explain the second pattern as it is more specific:

  • \b\d{3}\b – three digits enclosed with word boundaries
  • \s* – zero or more whitespaces
  • ((?:(?!\b\d{3}\b)[^])*?) – Group 1: any char ([^]), zero or more repetitions but as few as possible (*?), that does not start a 3-digit number enclosed with word boundaries
  • Adónem számlaszáma: – a fixed string
  • (\S*) – Group 2: zero or more non-whitespace chars.
Add a comment

Leave a Reply

Keep Up to Date with the Most Important News

By pressing the Subscribe button, you confirm that you have read and are agreeing to our Privacy Policy and Terms of Use

Discover more from Dev solutions

Subscribe now to keep reading and get access to the full archive.

Continue reading