Advertisements
I am very new to linux & bash script. I’m trying to read an xml file using curl command and count the number of occurrence of the word </entity>
in it.
curl -s "https://server:port/app/collection/admin/file?wt=xml&_=12334343432&file=samplefile.xml&contentType=text%2Fxml%3Bcharset%3Dutf-8" | grep '</entity>' -oP | wc -l
This works correctly, however the xml file consists of comments like below resulting in wrong count.
Sample XML file
.........
........
<entity>
.......
.......
</entity>
........
........
<!--
.......
<entity>
........
</entity>
.......
.......
-->
<entity>
.......
........
</entity>
The expected output should be 2 since one of the match is inside the comment block.
>Solution :
Since you’re using gnu-grep
here is a PCRE regex solution for your problem:
curl -s "https://server:port/app/collection/admin/file?wt=xml&_=12334343432&file=samplefile.xml&contentType=text%2Fxml%3Bcharset%3Dutf-8" |
grep -ZzoP '(?s)<!--.*?-->(*SKIP)(*F)|</entity>' |
tr '\0' '\n' |
wc -l
2
RegEx Details:
(?s)
: Enable DOTALL mode so that dot matches line breaks also<!--.*?-->
: Match a commented block(*SKIP)(*F)
: skips and fails this commented block|
: OR</entity>
: Match</entity>
outside commented blocktr '\0' '\n'
: Converts NUL bytes to line breakwc -l
: Counts number of lines