Follow

Keep Up to Date with the Most Important News

By pressing the Subscribe button, you confirm that you have read and are agreeing to our Privacy Policy and Terms of Use
Contact

Matching a regex metacharacter literally

My understanding of Regex’s in AWK is that in order to match a Regex metacharacter literally (For example: +,$,^,*,etc) You must escape them, like so:

awk -F '\\+' 'program here'

However I’ve noticed that you don’t actually need to do this with certain metacharacters, such as the "+"

Input file:

MEDevel.com: Open-source for Healthcare and Education

Collecting and validating open-source software for healthcare, education, enterprise, development, medical imaging, medical records, and digital pathology.

Visit Medevel

this|is|a|line
this+is+a+line
this?is?a?line
this*is*a*line
this$is$a$line
this.is.a.line

AWK program:


#!/usr/bin/awk -f
BEGIN { FS = "+|^"}

{print $1,$2,$3,$4 }

Expected output (Due to not escaping the +):

this|is|a|line
this+is+a+line
this?is?a?line
this*is*a*line
this$is$a$line
this.is.a.line

Actual output:


this|| is|| a|| line
his|is|a|line
this is a line
his?is?a?line
his*is*a*line
his$is$a$line
his.is.a.line

I don’t understand how this is working. I’m giving AWK blatantly bad code by not escaping the metacharacter (to make it literal) however AWK is matching successfully anyway?

I own a copy of "The AWK programming language" so I went through the section on Regex just to make sure I’m not going mad, and it states the following:

In a matching expression, a quoted string like "^[0-9]+$" can normally be used interchangeably with a regular expression enclosed in slashes, such as /^[0-9]+$/. There is one exception, however. If the string in quotes is to match a literal occurrence of a regular expression metacharacter, one extra backslash is needed to protect the protecting backslash itself. That is,

$0 ~ /(+|-)[0-9]+/

and

$0 ~ "(\+|-)[0-9]+"

are equivalent.

This behavior may seem arcane, but it arises because one level of protecting backslashes is removed when a quoted string is parsed by awk. If a backslash is needed in front of a metacharacter to turn off its special meaning in a regular expression, then that backslash needs a preceding backslash to protect it in a string.

Can someone explain what I’m missing here?

>Solution :

The + is at the start of the pattern: it can’t modify anything before that (i.e., allowing 1 or more of the non-existing character in front of it), thus awk interprets it as a literal + character, not a modifier.

From the gawk manual, on regex operator details

In POSIX awk and gawk, the ‘*’, ‘+’, and ‘?’ operators stand for themselves when there is nothing in the regexp that precedes them. For example, /+/ matches a literal plus sign. However, many other versions of awk treat such a usage as a syntax error.

Add a comment

Leave a Reply

Keep Up to Date with the Most Important News

By pressing the Subscribe button, you confirm that you have read and are agreeing to our Privacy Policy and Terms of Use

Discover more from Dev solutions

Subscribe now to keep reading and get access to the full archive.

Continue reading