I am very new to linux & bash script. I’m trying to read an xml file using curl command and count the number of occurrence of the word </entity> in it.
curl -s "https://server:port/app/collection/admin/file?wt=xml&_=12334343432&file=samplefile.xml&contentType=text%2Fxml%3Bcharset%3Dutf-8" | grep '</entity>' -oP | wc -l
This works correctly, however the xml file consists of comments like below resulting in wrong count.
Sample XML file
.........
........
<entity>
.......
.......
</entity>
........
........
<!--
.......
<entity>
........
</entity>
.......
.......
-->
<entity>
.......
........
</entity>
The expected output should be 2 since one of the match is inside the comment block.
>Solution :
Since you’re using gnu-grep here is a PCRE regex solution for your problem:
curl -s "https://server:port/app/collection/admin/file?wt=xml&_=12334343432&file=samplefile.xml&contentType=text%2Fxml%3Bcharset%3Dutf-8" |
grep -ZzoP '(?s)<!--.*?-->(*SKIP)(*F)|</entity>' |
tr '\0' '\n' |
wc -l
2
RegEx Details:
(?s): Enable DOTALL mode so that dot matches line breaks also<!--.*?-->: Match a commented block(*SKIP)(*F): skips and fails this commented block|: OR</entity>: Match</entity>outside commented blocktr '\0' '\n': Converts NUL bytes to line breakwc -l: Counts number of lines