Follow

Keep Up to Date with the Most Important News

By pressing the Subscribe button, you confirm that you have read and are agreeing to our Privacy Policy and Terms of Use
Contact

Regex to "normalize" usage of SPACE after . , : chars (and some exceptions)

I need to normalize some texts (product descriptions) in regard to the correct usage of .,,,: symbols (no space before and one space after)

The regex I’ve come up with is this:

$variation['DESCRIPTION'] = preg_replace('#\s*([:,.])\s*(?!<br />)#', '$1 ', $variation['DESCRIPTION']);

MEDevel.com: Open-source for Healthcare and Education

Collecting and validating open-source software for healthcare, education, enterprise, development, medical imaging, medical records, and digital pathology.

Visit Medevel

The problem is that this matches three cases it shouldn’t touch:

  • Any decimal number, like 5.5
  • Any thousand separator, like 4,500
  • A "fixed" phrase in Greek, ό,τι

Especially for the numeric exception, I know it can be achieved with some negative lookahead/lookbehind but unfortunately I can’t combine them in my current pattern.

This is a fiddle for you to check (the cases that shouldn’t be matched are in lines 2, 3, 4).

Any help will be very much appreciated! TIA

>Solution :

You can add two lookaheads containing lookbehinds:

\s*([:,.])(?!(?<=ό,)τι)(?!(?<=\d.)\d)(?!\s*<br\s*/>)\s*

See the regex demo. Note that I also added \s* to the last lookahead and swapped it with the consuming \s* to fail the match if there is <br/> after any zero or more whitespaces after the :, , or ..

Details

  • \s* – zero or more whitespaces
  • ([:,.]) – Group 1: a :, , or .
  • (?!(?<=ό,)τι) – fail the match if the next two chars are τι preceded with ό,
  • (?!(?<=\d.)\d) – fail the match if the next char is a digit preceded with a digit and any char (note that a . is enough since the [:,.] already match the char allowed/required, here, we just need to "jump" over that matched char)
  • (?!\s*<br\s*/>) – a negative lookahead that fails the match if there are zero or more whitespaces, <br, zero or more whitespaces, /> immediately to the right of the current location.
  • \s* – zero or more whitespaces.
Add a comment

Leave a Reply

Keep Up to Date with the Most Important News

By pressing the Subscribe button, you confirm that you have read and are agreeing to our Privacy Policy and Terms of Use

Discover more from Dev solutions

Subscribe now to keep reading and get access to the full archive.

Continue reading