Home How to scrape a various amount of <p> in between a <div> class

Questions

How to scrape a various amount of <p> in between a <div> class

October 16, 2022

I’m trying to scrape a webpage, which have an unknown amount of < p> tags, in between a known div class.. Some pages have only 1 < p> tag, while others have 10 or even more.. How can I extract them all? Preferable inside one variable, so I can store them inside a csv like all the other data’s I’m scraping 🙂

The HTML structure is as in the following example:

<div class="div_name">
    <h2 class="h5">title text</h2>
    <p>&nbsp;</p>
    <p>text text text...</p>
    <p>text text text...</p>
    <p>&nbsp;</p>
    <p><br>text text text...</p>
    <p>&nbsp;</p>
    <p><br>text text text...</p>
    <p>&nbsp;</p>
    <p><br>text text text...</p>
    <p>&nbsp;</p>
    <p><br>text text text...</p>
    <p>&nbsp;</p>
    <p>text text text...</p>
    <p>&nbsp;</p>
    <p><br>text text text...</p>
    <p>&nbsp;</p>
    <p><br>text text text...</p>
    <p>&nbsp;</p>
    <p><br>text text text...</p>
    <p>text text text...</p>
    <p>text text text...</p>
</div>

I’m using python and scrapy framework to achieve this.

Currently I have:

divs = response.xpath('/html/body/div[6]/div/section[2]/article/div/div/div')
for p in divs.xpath('.//p'):  # extracts all <p> inside
        print(p.get())
story = p

yield {
    'story': story
    }

It does print all the text values for the various < p> tags, but when stored to the csv file, only the last < p> is inserted to the *.csv.

To store the scraped data into *.csv, I have the following inside my settings.py:

# Deph of Crawler
DEPTH_LIMIT = 0 # 0 = Infinite depth

# Feed Export Settings
FEED_FORMAT="csv"
FEED_URI="output_%(name)s.csv"

and the yield part above, are the fields going into the *.csv.

Kindest regards,

>Solution :

You could do it in one line, really:

story = ' '.join([x.get().strip() for x in response.xpath('//div[6]/div/section[2]/article/div/div/div//p')])

If you would confirm the page url, I would probably be able to improve that long, fragile XPATH. Nonetheless, the above should work.

Scrapy documentation can be found at https://docs.scrapy.org/en/latest/

scrapy

byMR

Published October 16, 2022

Add a comment

how accept lowercase and uppercase for string data input in python?

byMR

October 16, 2022

Questions

Google Sheets: cell value based on another cell's multiple possible values (list from a range data validated)

byMR

October 16, 2022

Questions

Updating list in memory doesn't change UI in Flutter

byMR

October 16, 2022

Questions

Deleting a row in the laravel database

byMR

October 16, 2022

Questions

Java – How do I print the equation with the value of the variables?

byMR

October 16, 2022

Questions

How to get the shortest distance from a point on 2d plane from an array of location objects

byMR

October 16, 2022

How to scrape a various amount of <p> in between a <div> class

MEDevel.com: Open-source for Healthcare and Education

>Solution :

Like this:

Leave a ReplyCancel reply

Read more

how accept lowercase and uppercase for string data input in python?

Google Sheets: cell value based on another cell's multiple possible values (list from a range data validated)

Updating list in memory doesn't change UI in Flutter

Deleting a row in the laravel database

Java – How do I print the equation with the value of the variables?

How to get the shortest distance from a point on 2d plane from an array of location objects

Keep Up to Date with the Most Important News

How to scrape a various amount of <p> in between a <div> class

MEDevel.com: Open-source for Healthcare and Education

>Solution :

Share this:

Like this:

Leave a ReplyCancel reply

Keep Up to Date with the Most Important News

Read more

how accept lowercase and uppercase for string data input in python?

Google Sheets: cell value based on another cell's multiple possible values (list from a range data validated)

Updating list in memory doesn't change UI in Flutter

Deleting a row in the laravel database

Java – How do I print the equation with the value of the variables?

How to get the shortest distance from a point on 2d plane from an array of location objects

Discover more from Dev solutions