Grep exclude count of occurence match between comments <!– –> of curl body

May 13, 2022

I am very new to linux & bash script. I’m trying to read an xml file using curl command and count the number of occurrence of the word </entity> in it.

curl -s "https://server:port/app/collection/admin/file?wt=xml&_=12334343432&file=samplefile.xml&contentType=text%2Fxml%3Bcharset%3Dutf-8" | grep '</entity>' -oP | wc -l

This works correctly, however the xml file consists of comments like below resulting in wrong count.

Sample XML file

.........
........
 <entity>
.......
.......
</entity>
........
........
<!--
.......
<entity>
........
</entity>
.......
.......
-->
<entity>
.......
........
</entity>

The expected output should be 2 since one of the match is inside the comment block.

>Solution :

Since you’re using gnu-grep here is a PCRE regex solution for your problem:

curl -s "https://server:port/app/collection/admin/file?wt=xml&_=12334343432&file=samplefile.xml&contentType=text%2Fxml%3Bcharset%3Dutf-8" |
grep -ZzoP '(?s)<!--.*?-->(*SKIP)(*F)|</entity>' |
tr '\0' '\n' |
wc -l

2

RegEx Demo

RegEx Details: