Retriveing url from within <content:encoded> using BeautifulSoup

February 6, 2022

I’m struggling to retrieve a link to an image from inside an rss feed. I’m basically trying to get the url from ‘src=’ but all of the methods I’ve tried don’t seem to be able to draw it out.

<content:encoded><h4>Using sklearn’s GridSearchCV on random forest model</h4><figure><img alt="" src="https://cdn-images-1.medium.com/max/1024/1*M-LcJEuYvBjUFh1DhSOicA.jpeg" /><figcaption>Image by Annie Spratt via Unsplash</figcaption></figure><p>Finding the optimal tuning parameters for a machine learning problem can often be very difficult. We may encounter <strong>overfitting,</strong> which means our machine learning model trains too specifically on our training dataset and causes higher levels of error when applied to our test/holdout datasets. Or, we may run into <strong>underfitting,</strong> which means our model doesn’t train specifically enough to our training dataset. </content:encoded>

Below is the code I’ve been trying so far.

from bs4 import BeautifulSoup
import requests

resp = requests.get("https://towardsdatascience.com/feed")
soup = BeautifulSoup(resp.content, features='xml')
items = soup.findAll('item')
content_item = {}
content_item['title'] = items[0].title.text
content_item['link'] = items[0].link.text
content_item['Twitter'] = '@TDataScience'
content_item['Media'] = items[0].encoded['src']

As ever, any help you can offer would be very gratefully received.

Thanks in advance.

>Solution :

The first problem is that some items do not have <content:encoded> tag that is why it return NoneType object error when trying to access its contents. Even if all of them had that tag, still you wouldn’t be able to get urls since it is xml encoded (as its name indicates). Therefore, you need to decode it using html.unescape() (or any other decoder that fits your needs) before applying further operations:

import requests
import html
from bs4 import BeautifulSoup

resp = requests.get("https://towardsdatascience.com/feed")
soup = BeautifulSoup(resp.content, features='xml')
items = soup.findAll('item')

content_item = {}
for each_item in items[:5]: # using first 5 elements just to test
    content_item['title'] = each_item.title.text
    content_item['link'] = each_item.link.text
    content_item['Twitter'] = '@TDataScience'
    
    if each_item.find('content:encoded'):
        # decode and form the new soup
        decoded_html = BeautifulSoup(html.unescape(each_item.encoded.text), 'lxml')
        
        content_item['Media'] = decoded_html.img["src"]
    else:
        content_item['Media'] = None

    print(content_item)

The output would look like:

{'title': '3 Steps to Getting a Job in Data with Zero Experience', 'link': 'https://towardsdatascience.com/3-steps-to-getting-a-job-in-data-with-zero-experience-ccaad96d6477?source=rss----7f60cf5620c9---4', 'Twitter': '@TDataScience', 'Media': None}
{'title': 'AI for painting: Unraveling Neural Style Transfer', 'link': 'https://towardsdatascience.com/ai-for-painting-unraveling-neural-style-transfer-5ac08a20a580?source=rss----7f60cf5620c9---4', 'Twitter': '@TDataScience', 'Media': 'https://cdn-images-1.medium.com/max/1024/0*DQt1CKiJSDMrzaWA'}
{'title': 'A Novel Way to Use Batch Normalization', 'link': 'https://towardsdatascience.com/a-novel-way-to-use-batch-normalization-837176d53525?source=rss----7f60cf5620c9---4', 'Twitter': '@TDataScience', 'Media': 'https://cdn-images-1.medium.com/max/1024/0*rRQ5moh4bTSCY1zR'}
{'title': 'How to Build A Pooled OLS Regression Model For Panel Data Sets', 'link': 'https://towardsdatascience.com/how-to-build-a-pooled-ols-regression-model-for-panel-data-sets-a78358f9c2a?source=rss----7f60cf5620c9---4', 'Twitter': '@TDataScience', 'Media': 'https://cdn-images-1.medium.com/max/1024/1*nv5gPBul4YsKGctA7b4OZg.png'}
{'title': 'Understanding the native R pipe |>', 'link': 'https://towardsdatascience.com/understanding-the-native-r-pipe-98dea6d8b61b?source=rss----7f60cf5620c9---4', 'Twitter': '@TDataScience', 'Media': 'https://cdn-images-1.medium.com/max/1024/1*pnjfZFqrY1opjEVNOSjcSg.png'}

Note that there are multiple <img> tags inside <content:encoded> tag, I just fetched the first one as an example.