Follow

Keep Up to Date with the Most Important News

By pressing the Subscribe button, you confirm that you have read and are agreeing to our Privacy Policy and Terms of Use
Contact

Issue with BeautifulSoup: Some Image URLs Returning as None Despite `src` Attribute Presence

I am using BeautifulSoup to extract image URLs from an HTML structure in Python. The HTML structure contains several <img> tags with the src attribute. I’ve implemented the _get_images function, which uses BeautifulSoup’s find_all("img") method to retrieve the image URLs. However, I’m facing an issue where some image URLs are returning as None even though the src attribute is present in the HTML.

Here’s my _get_images function:

def _get_images(self, soup):
    article_images = []
    images = soup.find_all("img")

    for img in images:
        src = img.get('src')
        article_images.append(src)

    return article_images

The output I get shows that some URLs are None, while others are correctly retrieved. I have checked the HTML structure, and the <img> tags do contain the src attribute. What could be causing this problem, and how can I resolve it to fetch all the image URLs correctly?

MEDevel.com: Open-source for Healthcare and Education

Collecting and validating open-source software for healthcare, education, enterprise, development, medical imaging, medical records, and digital pathology.

Visit Medevel

What could be causing this problem, and how can I resolve it to fetch all the image URLs and titles correctly? My goal is to have a list of URLs, where each URL contains the src the image, and to ensure that no None values are present in the list.

>Solution :

Possibly the img elements are dynamic elements.


Solution

To extract the values of src attribute from the <img> elements you have to induce WebDriverWait for visibility_of_all_elements_located() and you can use either of the following locator strategies:

Code block:

def _get_images(self):
    article_images = [my_elem.get_attribute("src") for my_elem in WebDriverWait(driver, 20).until(EC.visibility_of_all_elements_located((By.TAG_NAME, "img")))]
    return article_images

Note : You have to add the following imports :

from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.common.by import By
from selenium.webdriver.support import expected_conditions as EC
Add a comment

Leave a Reply

Keep Up to Date with the Most Important News

By pressing the Subscribe button, you confirm that you have read and are agreeing to our Privacy Policy and Terms of Use

Discover more from Dev solutions

Subscribe now to keep reading and get access to the full archive.

Continue reading