Grep exclude count of occurence match between comments <!– –> of curl body

I am very new to linux & bash script. I’m trying to read an xml file using curl command and count the number of occurrence of the word </entity> in it.

curl -s "https://server:port/app/collection/admin/file?wt=xml&_=12334343432&file=samplefile.xml&contentType=text%2Fxml%3Bcharset%3Dutf-8&quot; | grep '</entity>' -oP | wc -l

This works correctly, however the xml file consists of comments like below resulting in wrong count.

Sample XML file

.........
........
 <entity>
.......
.......
</entity>
........
........
<!--
.......
<entity>
........
</entity>
.......
.......
-->
<entity>
.......
........
</entity>

The expected output should be 2 since one of the match is inside the comment block.

>Solution :

Since you’re using gnu-grep here is a PCRE regex solution for your problem:

curl -s "https://server:port/app/collection/admin/file?wt=xml&_=12334343432&file=samplefile.xml&contentType=text%2Fxml%3Bcharset%3Dutf-8" |
grep -ZzoP '(?s)<!--.*?-->(*SKIP)(*F)|</entity>' |
tr '\0' '\n' |
wc -l

2

RegEx Demo

RegEx Details:

  • (?s): Enable DOTALL mode so that dot matches line breaks also
  • <!--.*?-->: Match a commented block
  • (*SKIP)(*F): skips and fails this commented block
  • |: OR
  • </entity>: Match </entity> outside commented block
  • tr '\0' '\n': Converts NUL bytes to line break
  • wc -l: Counts number of lines

Leave a Reply