Follow

Keep Up to Date with the Most Important News

By pressing the Subscribe button, you confirm that you have read and are agreeing to our Privacy Policy and Terms of Use
Contact

Scrapy file, only running the initial start_urls instead of running though the whole list

As the title states, I am trying to run my scrapy program, the issue I am running into is that it seems to be only returning the yield from the initial url (https://www.antaira.com/products/10-100Mbps).

I am unsure on where my program is not working, in my code I have also left some commented code on what I have attempted.

import scrapy
from ..items import AntairaItem


class ProductJumperFix(scrapy.Spider):  # classes should be TitleCase

    name = 'productJumperFix'
    allowed_domains = ['antaira.com']
    start_urls = [
        'https://www.antaira.com/products/10-100Mbps',
        'https://www.antaira.com/products/unmanaged-gigabit'
        'https://www.antaira.com/products/unmanaged-10-100Mbps-PoE'
        'https://www.antaira.com/products/Unmanaged-Gigabit-PoE'
        'https://www.antaira.com/products/Unmanaged-10-gigabit'
        'https://www.antaira.com/products/Unmanaged-10-gigabit-PoE'
    ]
    
    #def start_requests(self):
    #    yield scrappy.Request(start_urls, self.parse)

    def parse(self, response):
        # iterate through each of the relative urls
        for url in response.xpath('//div[@class="product-container"]//a/@href').getall():
            product_link = response.urljoin(url)  # use variable
            yield scrapy.Request(product_link, callback=self.parse_new_item)

    def parse_new_item(self, response):
        for product in response.css('main.products'):
            items = AntairaItem() # Unique item for each iteration
            items['product_link'] = response.url # get the product link from response
            name = product.css('h1.product-name::text').get().strip()
            features = product.css(('section.features h3 + ul').strip()).getall()
            overview =   product.css('.products .product-overview::text').getall()
            main_image = response.urljoin(product.css('div.selectors img::attr(src)').get())
            rel_links = product.xpath("//script/@src[contains(., '/app/site/hosting/scriptlet.nl')]").getall()
            items['name'] = name,
            items['features'] = features,
            items['overview'] = overview,
            items['main_image'] = main_image,
            items['rel_links'] = rel_links,
            yield items

Thank you everyone!

MEDevel.com: Open-source for Healthcare and Education

Collecting and validating open-source software for healthcare, education, enterprise, development, medical imaging, medical records, and digital pathology.

Visit Medevel

Follow up question, for some reason when I run "scrapy crawl productJumperFix" im not getting any output from the terminal,not sure how to debug since I can’t even see the output errors.

>Solution :

Try using the start_requests method:

For example:

import scrapy
from ..items import AntairaItem

class ProductJumperFix(scrapy.Spider):

    name = 'productJumperFix'
    allowed_domains = ['antaira.com']

    def start_requests(self):
        urls = [
            'https://www.antaira.com/products/10-100Mbps',
            'https://www.antaira.com/products/unmanaged-gigabit',
            'https://www.antaira.com/products/unmanaged-10-100Mbps-PoE',
            'https://www.antaira.com/products/Unmanaged-Gigabit-PoE',
            'https://www.antaira.com/products/Unmanaged-10-gigabit',
            'https://www.antaira.com/products/Unmanaged-10-gigabit-PoE',
        ]
        for url in urls:
            yield scrapy.Request(url, callback=self.parse)

    def parse(self, response):
        for url in response.xpath('//div[@class="product-container"]//a/@href').getall():
            product_link = response.urljoin(url)  # use variable
            yield scrapy.Request(product_link, callback=self.parse_new_item)

    def parse_new_item(self, response):
        for product in response.css('main.products'):
            items = AntairaItem() 
            items['product_link'] = response.url
            name = product.css('h1.product-name::text').get().strip()
            features = product.css(('section.features h3 + ul').strip()).getall()
            overview =   product.css('.products .product-overview::text').getall()
            main_image = response.urljoin(product.css('div.selectors img::attr(src)').get())
            rel_links = product.xpath("//script/@src[contains(., '/app/site/hosting/scriptlet.nl')]").getall()
            items['name'] = name,
            items['features'] = features,
            items['overview'] = overview,
            items['main_image'] = main_image,
            items['rel_links'] = rel_links,
            yield items
Add a comment

Leave a Reply

Keep Up to Date with the Most Important News

By pressing the Subscribe button, you confirm that you have read and are agreeing to our Privacy Policy and Terms of Use

Discover more from Dev solutions

Subscribe now to keep reading and get access to the full archive.

Continue reading