I consider my self pretty good with Regular Expressions, but this one is appearing to be surprisingly tricky.
I want to trim all whitespace, except the ones between "" and [] characters.
I used this regex ("[^"]*"|\S+)\s+ but did split the [06/Jan/2021:17:50:09 +0300] part of my log into two blocks.
Here is my entire log line :
[06/Jan/2021:17:50:09 +0300] "" 10.139.3.194 407 "CONNECT clients5.google.com:443 HTTP/1.1" "" "-" "" 4245 75 "" "" "81" ""
Result I am getting based on my regex using sed command (replacing whitespace by comma):
[06/Jan/2021:17:50:09,+0300],"",10.139.3.194,407,"CONNECT clients5.google.com:443 HTTP/1.1","","-","",4245,75,"","","81",""
Finally the result that I want to have :
[06/Jan/2021:17:50:09 +0300],"",10.139.3.194,407,"CONNECT clients5.google.com:443 HTTP/1.1","","-","",4245,75,"","","81",""
>Solution :
You can match strings between square brackets by adding \[[^][]*] as an alternative to Group 1 pattern:
sed -E 's/(\[[^][]*]|"[^"]*"|\S+)\s+/\1,/g'
Now, the POSIX ERE (syntax enabled with the -E option) pattern matches
(\[[^][]*]|"[^"]*"|\S+)– Group 1: either\[[^][]*]– a[char, then zero or more chars other than[and]and then a]char|"[^"]*"– a"char, zero or more chars other than"and then a"char|– or\S+– one or more non-whitespace chars
\s+– one or more whitespaces
See the online demo:
#!/bin/bash
s='[06/Jan/2021:17:50:09 +0300] "" 10.139.3.194 407 "CONNECT clients5.google.com:443 HTTP/1.1" "" "-" "" 4245 75 "" "" "81" ""'
sed -E 's/(\[[^][]*]|"[^"]*"|\S+)\s+/\1,/g' <<< "$s"
Output:
[06/Jan/2021:17:50:09 +0300],"",10.139.3.194,407,"CONNECT clients5.google.com:443 HTTP/1.1","","-","",4245,75,"","","81",""