Follow

Keep Up to Date with the Most Important News

By pressing the Subscribe button, you confirm that you have read and are agreeing to our Privacy Policy and Terms of Use
Contact

scrapy intercepts not all of the markup that comes in the request

I’m trying to intercept the markup that comes in http packets, but I only get part of that markup. For some reason it cuts off in the middle. Is it related to that? Here is my code:

import scrapy
from scrapy.crawler import CrawlerProcess
from scrapy.utils.log import configure_logging


class StackOverflowSpider(scrapy.Spider):
    
    name = 'stackoverflow'
    allowed_domains = ['stackoverflow.com']
    start_urls = ['https://stackoverflow.com/questions/tagged/python?tab=newest&page=1&pagesize=15']
    first_request_done = False
    
    def start_requests(self):
        if not self.first_request_done:
            self.first_request_done = True
            for url in self.start_urls:
                yield scrapy.Request(url=url, callback=self.parse, dont_filter=True)
            
    def parse(self, response):
        if response.status == 200 and response.headers.get('Content-Type', '').startswith(b'text/html'):
            html = response.body.decode('utf-8')
            print(html)
        
        yield
    

configure_logging()
process = CrawlerProcess(settings={
    'LOG_ENABLED': False,
    'DOWNLOAD_DELAY': 1,
    'CONCURRENT_REQUESTS': 1
})
process.crawl(StackOverflowSpider)
process.start(stop_after_crawl=False)

>Solution :

MEDevel.com: Open-source for Healthcare and Education

Collecting and validating open-source software for healthcare, education, enterprise, development, medical imaging, medical records, and digital pathology.

Visit Medevel

This is just the python print function not properly flushing the output… This can be demonstrated by spliting the page content into lines and printing them out one at a time, or alternatively writing the contents to a file and viewing the full output in the written file.

For example, you can try this to print it out line by line:

def parse(self, response):
    for line in response.text.splitlines():
        print(line)

or if you wanted to write the contents to a file:

def parse(self, response):
    with open('response.html', "wt", encoding="utf8") as htmlfile:
        htmlfile.write(response.text)
    ...
    ...
Add a comment

Leave a Reply

Keep Up to Date with the Most Important News

By pressing the Subscribe button, you confirm that you have read and are agreeing to our Privacy Policy and Terms of Use

Discover more from Dev solutions

Subscribe now to keep reading and get access to the full archive.

Continue reading