Follow

Keep Up to Date with the Most Important News

By pressing the Subscribe button, you confirm that you have read and are agreeing to our Privacy Policy and Terms of Use
Contact

Regex for whitespace delemiter except for [ and ] characters

I consider my self pretty good with Regular Expressions, but this one is appearing to be surprisingly tricky.

I want to trim all whitespace, except the ones between "" and [] characters.

I used this regex ("[^"]*"|\S+)\s+ but did split the [06/Jan/2021:17:50:09 +0300] part of my log into two blocks.

MEDevel.com: Open-source for Healthcare and Education

Collecting and validating open-source software for healthcare, education, enterprise, development, medical imaging, medical records, and digital pathology.

Visit Medevel

Here is my entire log line :

[06/Jan/2021:17:50:09 +0300] "" 10.139.3.194 407 "CONNECT clients5.google.com:443 HTTP/1.1" "" "-" "" 4245 75 "" "" "81" ""

Result I am getting based on my regex using sed command (replacing whitespace by comma):

[06/Jan/2021:17:50:09,+0300],"",10.139.3.194,407,"CONNECT clients5.google.com:443 HTTP/1.1","","-","",4245,75,"","","81",""

Finally the result that I want to have :

[06/Jan/2021:17:50:09 +0300],"",10.139.3.194,407,"CONNECT clients5.google.com:443 HTTP/1.1","","-","",4245,75,"","","81",""

>Solution :

You can match strings between square brackets by adding \[[^][]*] as an alternative to Group 1 pattern:

sed -E 's/(\[[^][]*]|"[^"]*"|\S+)\s+/\1,/g'

Now, the POSIX ERE (syntax enabled with the -E option) pattern matches

  • (\[[^][]*]|"[^"]*"|\S+) – Group 1: either
    • \[[^][]*] – a [ char, then zero or more chars other than [ and ] and then a ] char
    • |
    • "[^"]*" – a " char, zero or more chars other than " and then a " char
    • | – or
    • \S+ – one or more non-whitespace chars
  • \s+ – one or more whitespaces

See the online demo:

#!/bin/bash
s='[06/Jan/2021:17:50:09 +0300] "" 10.139.3.194 407 "CONNECT clients5.google.com:443 HTTP/1.1" "" "-" "" 4245 75 "" "" "81" ""'
sed -E 's/(\[[^][]*]|"[^"]*"|\S+)\s+/\1,/g' <<< "$s"

Output:

[06/Jan/2021:17:50:09 +0300],"",10.139.3.194,407,"CONNECT clients5.google.com:443 HTTP/1.1","","-","",4245,75,"","","81",""
Add a comment

Leave a Reply

Keep Up to Date with the Most Important News

By pressing the Subscribe button, you confirm that you have read and are agreeing to our Privacy Policy and Terms of Use

Discover more from Dev solutions

Subscribe now to keep reading and get access to the full archive.

Continue reading