Regex to "normalize" usage of SPACE after . , : chars (and some exceptions)

November 14, 2021

I need to normalize some texts (product descriptions) in regard to the correct usage of .,,,: symbols (no space before and one space after)

The regex I’ve come up with is this:

$variation['DESCRIPTION'] = preg_replace('#\s*([:,.])\s*(?!<br />)#', '$1 ', $variation['DESCRIPTION']);

The problem is that this matches three cases it shouldn’t touch:

Any decimal number, like 5.5
Any thousand separator, like 4,500
A "fixed" phrase in Greek, ό,τι

Especially for the numeric exception, I know it can be achieved with some negative lookahead/lookbehind but unfortunately I can’t combine them in my current pattern.

This is a fiddle for you to check (the cases that shouldn’t be matched are in lines 2, 3, 4).

Any help will be very much appreciated! TIA

>Solution :

You can add two lookaheads containing lookbehinds:

\s*([:,.])(?!(?<=ό,)τι)(?!(?<=\d.)\d)(?!\s*<br\s*/>)\s*

See the regex demo. Note that I also added \s* to the last lookahead and swapped it with the consuming \s* to fail the match if there is <br/> after any zero or more whitespaces after the :, , or ..

Details

\s* – zero or more whitespaces
([:,.]) – Group 1: a :, , or .
(?!(?<=ό,)τι) – fail the match if the next two chars are τι preceded with ό,
(?!(?<=\d.)\d) – fail the match if the next char is a digit preceded with a digit and any char (note that a . is enough since the [:,.] already match the char allowed/required, here, we just need to "jump" over that matched char)
(?!\s*<br\s*/>) – a negative lookahead that fails the match if there are zero or more whitespaces, <br, zero or more whitespaces, /> immediately to the right of the current location.
\s* – zero or more whitespaces.