Follow

Keep Up to Date with the Most Important News

By pressing the Subscribe button, you confirm that you have read and are agreeing to our Privacy Policy and Terms of Use
Contact

How to scrape a various amount of <p> in between a <div> class

I’m trying to scrape a webpage, which have an unknown amount of < p> tags, in between a known div class.. Some pages have only 1 < p> tag, while others have 10 or even more.. How can I extract them all? Preferable inside one variable, so I can store them inside a csv like all the other data’s I’m scraping 🙂

The HTML structure is as in the following example:

<div class="div_name">
    <h2 class="h5">title text</h2>
    <p>&nbsp;</p>
    <p>text text text...</p>
    <p>text text text...</p>
    <p>&nbsp;</p>
    <p><br>text text text...</p>
    <p>&nbsp;</p>
    <p><br>text text text...</p>
    <p>&nbsp;</p>
    <p><br>text text text...</p>
    <p>&nbsp;</p>
    <p><br>text text text...</p>
    <p>&nbsp;</p>
    <p>text text text...</p>
    <p>&nbsp;</p>
    <p><br>text text text...</p>
    <p>&nbsp;</p>
    <p><br>text text text...</p>
    <p>&nbsp;</p>
    <p><br>text text text...</p>
    <p>text text text...</p>
    <p>text text text...</p>
</div>

I’m using python and scrapy framework to achieve this.

MEDevel.com: Open-source for Healthcare and Education

Collecting and validating open-source software for healthcare, education, enterprise, development, medical imaging, medical records, and digital pathology.

Visit Medevel

Currently I have:

divs = response.xpath('/html/body/div[6]/div/section[2]/article/div/div/div')
for p in divs.xpath('.//p'):  # extracts all <p> inside
        print(p.get())
story = p

yield {
    'story': story
    }

It does print all the text values for the various < p> tags, but when stored to the csv file, only the last < p> is inserted to the *.csv.

To store the scraped data into *.csv, I have the following inside my settings.py:

# Deph of Crawler
DEPTH_LIMIT = 0 # 0 = Infinite depth

# Feed Export Settings
FEED_FORMAT="csv"
FEED_URI="output_%(name)s.csv"

and the yield part above, are the fields going into the *.csv.

Kindest regards,

>Solution :

You could do it in one line, really:

story = ' '.join([x.get().strip() for x in response.xpath('//div[6]/div/section[2]/article/div/div/div//p')])

If you would confirm the page url, I would probably be able to improve that long, fragile XPATH. Nonetheless, the above should work.

Scrapy documentation can be found at https://docs.scrapy.org/en/latest/

Add a comment

Leave a Reply

Keep Up to Date with the Most Important News

By pressing the Subscribe button, you confirm that you have read and are agreeing to our Privacy Policy and Terms of Use

Discover more from Dev solutions

Subscribe now to keep reading and get access to the full archive.

Continue reading