As the title suggests. I’m playing around with a Twitter bot that scrapes rss feeds and tweets the title of the article and a link.
For some reason when I run the below code it runs without errors but doesn’t retrieve the url link. Any suggestions are gratefully recieved.
from bs4 import BeautifulSoup
import requests
url = "https://www.kdnuggets.com/feed"
resp = requests.get(url)
soup = BeautifulSoup(resp.content)
items = soup.findAll('item')
item = items[1]
print(item.title.text)
print(item.link.text)
The title prints fine but the link is nowhere to be found. For reference, below is a copy of the html that is returned for this item.
<item>
<title>An Overview of Logistic Regression</title>
<link/>https://www.kdnuggets.com/2022/02/overview-logistic-regression.html
<comments>https://www.kdnuggets.com/2022/02/overview-logistic-regression.html#disqus_thread</comments>
<dc:creator><![CDATA[Matt Mayo Editor]]></dc:creator>
<pubdate>Fri, 04 Feb 2022 13:00:11 +0000</pubdate>
<category><![CDATA[2022 Feb Tutorials, Overviews]]></category>
<category><![CDATA[Machine Learning]]></category>
<guid ispermalink="false">https://www.kdnuggets.com/?p=137943</guid>
<description><![CDATA[Logistic regression is an extension of linear regression to solve classification problems. Read more on the specifics of this algorithm here.]]></description>
<wfw:commentrss>https://www.kdnuggets.com/2022/02/overview-logistic-regression.html/feed</wfw:commentrss>
<slash:comments>0</slash:comments>
</item>
Thanks in advance.
>Solution :
You aren’t getting item.link.text because it’s empty – the link element is
<link/>
Try this method to get the text:
>>> item.link
<link/>
>>> item.link.findNext().text
'https://www.kdnuggets.com/2022/02/overview-logistic-regression.html#disqus_thread'
You’ll still need to strip off the #.... but that’s straightforward to do