Follow

Keep Up to Date with the Most Important News

By pressing the Subscribe button, you confirm that you have read and are agreeing to our Privacy Policy and Terms of Use
Contact

Select every 1st <li> that has the same class name in a group of <div> with the same set of <li>

I’m trying to learn to work with Python and BeautifulSoup. As a project for myself I am scraping a recipe website and displaying certain items in a template to learn to work with it.
The website is displaying meal prep time, calories and the amount of people who can eat from a recipe in a row as li in a div.
There are 35 such div in a grid on the website. I want to only select the meal prep time from the div to store in a list. All of the li have the same class and no other attributes. How do I only select the li I need?

Below the HTML code of the page. There are 35 of these div, each with a different recipe.

 <div class="column xxlarge-4 large-6 small-12 ">
    <a role="link" aria-label="Recept: 'Tiramisu' met advocaat" data-testhook="recipe-card" title="Recept: 'Tiramisu' met advocaat" href="/allerhande/recept/R-R1196417/tiramisu-met-advocaat" class="display-card_root__o17AY card_root__VNG0M card_roundCorners__dYaFu display-card_anchor__cTFon" data-analytics="LINK_CLICK" data-analytics-meta="%7B%22component%22%3A%22recipe-search%22%2C%22href%22%3A%22%2Fallerhande%2Frecept%2FR-R1196417%2Ftiramisu-met-advocaat%22%2C%22title%22%3A%22R-R1196417%22%7D">
    <div class="display-card-section_section__42C0n display-card-body_body__r2mt4 card-body_root__E16CU">
    <div class="ratio-box_root__YH5Fe ratio-box_ratio-21-10__thBP0">
    <div class="ratio-box_content__k-Jz7">
    <img class="card-image-set_imageSet__Su7xI lazyautosizes ls-is-cached lazyloaded" alt="'Tiramisu' met advocaat" data-srcset=", https://static.ah.nl/static/recepten/img_RAM_PRD163172_220x162_JPG.jpg 220w 162h, >
    </div>
    </div>
    </div>
    <footer class="display-card-section_section__42C0n display-card-section_padded__lHvvK display-card-footer_footer__cxMve card-footer_root__0dl7R">
    <ul class="recipe-card-properties_root__rFiwt recipe-card-properties_allerhande__0gSBC" data-testhook="recipe-card-properties">
<li class="recipe-card-properties_property__87cH1">
    <svg xmlns="http://www.w3.org/2000/svg" xmlns:xlink="http://www.w3.org/1999/xlink" class="allerhande-icon recipe-card-properties_icon__wBmG9 svg svg--svg_time" viewBox="0 0 24 24" width="24" height="16">
    <use xlink:href="#svg_time">
    </use>
    </svg>
    20 min
    </li>
    <li class="recipe-card-properties_property__87cH1">
    <svg xmlns="http://www.w3.org/2000/svg" xmlns:xlink="http://www.w3.org/1999/xlink" class="allerhande-icon recipe-card-properties_icon__wBmG9 svg svg--svg_calories" viewBox="0 0 24 24" width="24" height="16">
    <use xlink:href="#svg_calories">
    </use>
    </svg>
    545 kcal
    </li>
    <li class="recipe-card-properties_property__87cH1">
    <svg xmlns="http://www.w3.org/2000/svg" xmlns:xlink="http://www.w3.org/1999/xlink" class="allerhande-icon recipe-card-properties_icon__wBmG9 svg svg--svg_person" viewBox="0 0 24 24" width="24" height="16">
    <use xlink:href="#svg_person">
    </use>
    </svg>
    8</li>
    </ul>
    <p class="typography_root__Om3Wh typography_variant-paragraph__T5ZAU typography_hasMargin__4EaQi card-text_title__REC-7">
    <span class="line-clamp_root__7DevG line-clamp_active__5Qc2L card-text_titleText__7T9sY card-text_boldTitle__SVYw2" data-testhook="recipe-card-title" style="-webkit-line-clamp: 2; line-height: 1.2em; max-height: 2.4em;">
    'Tiramisu' met advocaat
    </span>
    </p>
    </footer>
    </a>
    </div>

and here is the code I am using to substract the information I need:

MEDevel.com: Open-source for Healthcare and Education

Collecting and validating open-source software for healthcare, education, enterprise, development, medical imaging, medical records, and digital pathology.

Visit Medevel

#Create soup
    webpage_response = requests.get("https://www.ah.nl/allerhande/recepten-zoeken?sortBy=TRENDING")
    webpage = webpage_response.content
    soup = BeautifulSoup(webpage, "html.parser")

    recipe_links = soup.find_all('a', attrs={'class' : re.compile('^display-card_root__.*')})
    recipe_pictures = soup.find_all('img', attrs={'class' : re.compile('^card-image-set_imageSet__.*')})
    recipe_prep_time = soup.find_all('li', attrs={'class' : re.compile('^recipe-card-properties_property__.*')})

However: this selects all the li items, including calories etc, which creates an issue if I want to select the correct time from the list.How can I onlt select the first li?

>Solution :

Simple and straightforward solution:

recipe_prep_time = [ul.find('li').text 
                   for ul in soup.find_all('ul',
                        attrs={'class': re.compile('^recipe-card-properties_root')})]

yields

['15 min',
 '15 min',
 '20 min',
 '20 min',
 '35 min',
 '20 min',
 '20 min',
 '10 min',
 ...]
Add a comment

Leave a Reply

Keep Up to Date with the Most Important News

By pressing the Subscribe button, you confirm that you have read and are agreeing to our Privacy Policy and Terms of Use

Discover more from Dev solutions

Subscribe now to keep reading and get access to the full archive.

Continue reading