My understanding of Regex’s in AWK is that in order to match a Regex metacharacter literally (For example: +,$,^,*,etc) You must escape them, like so:
awk -F '\\+' 'program here'
However I’ve noticed that you don’t actually need to do this with certain metacharacters, such as the "+"
Input file:
this|is|a|line
this+is+a+line
this?is?a?line
this*is*a*line
this$is$a$line
this.is.a.line
AWK program:
#!/usr/bin/awk -f
BEGIN { FS = "+|^"}
{print $1,$2,$3,$4 }
Expected output (Due to not escaping the +):
this|is|a|line
this+is+a+line
this?is?a?line
this*is*a*line
this$is$a$line
this.is.a.line
Actual output:
this|| is|| a|| line
his|is|a|line
this is a line
his?is?a?line
his*is*a*line
his$is$a$line
his.is.a.line
I don’t understand how this is working. I’m giving AWK blatantly bad code by not escaping the metacharacter (to make it literal) however AWK is matching successfully anyway?
I own a copy of "The AWK programming language" so I went through the section on Regex just to make sure I’m not going mad, and it states the following:
In a matching expression, a quoted string like "^[0-9]+$" can normally be used interchangeably with a regular expression enclosed in slashes, such as /^[0-9]+$/. There is one exception, however. If the string in quotes is to match a literal occurrence of a regular expression metacharacter, one extra backslash is needed to protect the protecting backslash itself. That is,
$0 ~ /(+|-)[0-9]+/
and
$0 ~ "(\+|-)[0-9]+"
are equivalent.
This behavior may seem arcane, but it arises because one level of protecting backslashes is removed when a quoted string is parsed by awk. If a backslash is needed in front of a metacharacter to turn off its special meaning in a regular expression, then that backslash needs a preceding backslash to protect it in a string.
Can someone explain what I’m missing here?
>Solution :
The +
is at the start of the pattern: it can’t modify anything before that (i.e., allowing 1 or more of the non-existing character in front of it), thus awk interprets it as a literal +
character, not a modifier.
From the gawk manual, on regex operator details
In POSIX awk and gawk, the ‘*’, ‘+’, and ‘?’ operators stand for themselves when there is nothing in the regexp that precedes them. For example, /+/ matches a literal plus sign. However, many other versions of awk treat such a usage as a syntax error.