I am trying to select all of the first occurrences of a specific type in the following structure:
<div class="jobs-list">
<div class="job-listing">
<h3>Title1</h3>
<span class="organization">
<a href="https://www.domain1.org/" target="_blank">Org1</a>
</span>
<span class="location">Loc1</span>
<div class="description">
desc1
<a href="https://www.domain1-1.org/" target="_blank">https://www.domain1-1.org/</a>
<span class="list-date">Posted on: 01/19/2022</span>
</div>
</div>
<div class="job-listing">
<h3>Title2</h3>
<span class="organization">
<a href="https://www.domain2.org/" target="_blank">Org2</a>
</span>
<span class="location">Loc2</span>
<div class="description">
desc2
<a href="https://www.domain2.org/" target="_blank">https://www.domain2.org/</a>
<span class="list-date">Posted on: 01/18/2022</span>
</div>
</div>
<div class="job-listing">
<h3>Title3</h3>
<span class="organization">
<a href="https://www.domain3.org/" target="_blank">Org3</a>
</span>
<span class="location">Loc3</span>
<div class="description">
desc3
<a href="mailto:user@domain3.org">user@domain3.org</a>
<span class="list-date">Posted on: 01/19/2022</span>
</div>
</div>
<div class="job-listing">
<h3>TItle4</h3>
<span class="organization">Org4</span>
<span class="location">Loc4</span>
<div class="description">
desc4
<a href="mailto:user@domain4.org">user@domain4.org</a>
<a href="https://www.domain4.org/" target="_blank">https://www.domain4.org/</a>
<a href="https://www.domain4-1.org/" target="_blank">https://www.domain4-1.org/</a>
<span class="list-date">Posted on: 01/06/2022</span>
</div>
</div>
</div>
Specifically, I need the result to be the following:
https://www.domain1.org/
https://www.domain2.org/
https://www.domain3.org/
https://www.domain4.org/
Which should be the first a/@href under each div[@class='job-listing'], but I’m not sure how to express that. Some things to note:
- The
<a>is always two nodes under the root (job-listing) - The first
<a>isn’t always correct (only looking for http), but I can filter those out easily enough; I’m caught up on how to select the node, not filtering for the content or anything like that. - I need the value of
a/@href, not the contents of<a>.
Thanks!
>Solution :
//div[@class='job-listing']/descendant::a[1] gives you the first a descendant of each of those divs, if you want to add the check then use e.g. //div[@class='job-listing']/descendant::a[starts-with(@href, 'http')][1].
If you need the href attribute node use //div[@class='job-listing']/descendant::a[starts-with(@href, 'http')][1]/@href. Note that some default serialization for XSLT or XQuery doesn’t allow you to serialize a sequence of standalone attribute nodes but in XPath 2 or 3 you can of course use e.g. //div[@class='job-listing']/descendant::a[starts-with(@href, 'http')][1]/@href/string() to get a sequence of attribute values instead.