Follow

Keep Up to Date with the Most Important News

By pressing the Subscribe button, you confirm that you have read and are agreeing to our Privacy Policy and Terms of Use
Contact

Skipping HTML tag within Scrapy

I am scraping data using Scrapy (Python3) from a website and I would like to skip an <a> tag withing the source code because there are two and both have the same classes as you can see in the picture below:

enter image description here

I am trying the secect the <a> tag that is highlighted in blue.

MEDevel.com: Open-source for Healthcare and Education

Collecting and validating open-source software for healthcare, education, enterprise, development, medical imaging, medical records, and digital pathology.

Visit Medevel

I’m using this: response.xpath("//nav[@class='mp-PaginationControls-pagination']/a/@href").get(), but that only let’s me select the first <a> tag so it bugs after I’m on page two.

Here is the raw HTML:

<div class="mp-PaginationControls mp-PaginationControls--new">
  <nav class="mp-PaginationControls-pagination">
    <a class="mp-TextLink mp-Button mp-Button--primary" href="/l/muziek-en-instrumenten/microfoons/">
      <span aria-hidden="true" class="mp-Button-icon mp-Button-icon--center mp-svg-arrow-left--inverse"></span>
    </a>
    <span class="mp-PaginationControls-pagination-pageList">
      <a class="mp-TextLink" href="/l/muziek-en-instrumenten/microfoons/">1</a>
      <span>2</span>
      <a class="mp-TextLink" href="/l/muziek-en-instrumenten/microfoons/p/3/">3</a>
      <span>...</span>
      <span>142</span>
    </span>
    <span class="mp-PaginationControls-pagination-amountOfPages">Pagina 2 van 142</span>
    <a class="mp-TextLink mp-Button mp-Button--primary" href="/l/muziek-en-instrumenten/microfoons/p/3/">
      <span aria-hidden="true" class="mp-Button-icon mp-Button-icon--center mp-svg-arrow-right--inverse"></span>
    </a>
  </nav>
</div>

Thanks in advance

>Solution :

As I see from the XML you shared the second a has different href attribute value.
But since you want to get the href value of it I guess you can’t build your XPath based on it…
But below the a are span nodes, so you can find the parent a based on it.
As following:

response.xpath("//nav[@class='mp-PaginationControls-pagination']//a[./span[contains(@class,'mp-svg-arrow-right--inverse')]]/@href").get()
Add a comment

Leave a Reply

Keep Up to Date with the Most Important News

By pressing the Subscribe button, you confirm that you have read and are agreeing to our Privacy Policy and Terms of Use

Discover more from Dev solutions

Subscribe now to keep reading and get access to the full archive.

Continue reading