I have this XML page that I’m trying to scrape, but I’m not able to get the content of some of the tags. The ones that drop down are possible, but the others are not.
This is the page I’m trying to scrape: https://g1.globo.com/rss/g1/
I’m trying to get the ‘pubDate’ tag and when I try find_all it comes back empty, when I try with find, it comes back as None. This is my code. I’ve have tried many ways, but have failed.
rss_globo = requests.get("https://g1.globo.com/rss/g1/").content
bs_globo = BeautifulSoup(rss_globo, 'lxml')
date = bs_globo.find_all('item')
for i in date:
date = i.find('pubDate').getText()
print(date)
>Solution :
In order to work with xml you need a feature not a parser.
Here’s how:
import requests
from bs4 import BeautifulSoup
bs_globo = BeautifulSoup(
requests.get("https://g1.globo.com/rss/g1/").content,
features="xml",
)
for i in bs_globo.find_all('item'):
print(i.find('pubDate').getText())
Output:
Mon, 07 Mar 2022 15:05:42 -0000
Mon, 07 Mar 2022 15:04:53 -0000
Mon, 07 Mar 2022 15:04:41 -0000
Mon, 07 Mar 2022 15:03:38 -0000
Mon, 07 Mar 2022 15:03:15 -0000
Mon, 07 Mar 2022 15:01:14 -0000
Mon, 07 Mar 2022 15:00:37 -0000
Mon, 07 Mar 2022 15:00:26 -0000
Mon, 07 Mar 2022 15:00:09 -0000
Mon, 07 Mar 2022 15:00:04 -0000
Mon, 07 Mar 2022 14:59:32 -0000
Mon, 07 Mar 2022 14:58:46 -0000
Mon, 07 Mar 2022 14:58:04 -0000
Mon, 07 Mar 2022 14:58:02 -0000
Mon, 07 Mar 2022 14:55:24 -0000
Mon, 07 Mar 2022 14:51:20 -0000
Mon, 07 Mar 2022 14:50:45 -0000
Mon, 07 Mar 2022 14:50:22 -0000
Mon, 07 Mar 2022 14:50:07 -0000
Mon, 07 Mar 2022 14:49:01 -0000
Mon, 07 Mar 2022 14:47:23 -0000
Mon, 07 Mar 2022 14:47:21 -0000
Mon, 07 Mar 2022 14:46:34 -0000
Mon, 07 Mar 2022 14:46:31 -0000
Mon, 07 Mar 2022 14:45:45 -0000
Mon, 07 Mar 2022 14:45:02 -0000
Mon, 07 Mar 2022 14:44:37 -0000
Mon, 07 Mar 2022 14:44:16 -0000
Mon, 07 Mar 2022 14:43:37 -0000
Mon, 07 Mar 2022 14:42:56 -0000
Mon, 07 Mar 2022 14:42:39 -0000
Mon, 07 Mar 2022 14:42:16 -0000
Mon, 07 Mar 2022 14:41:51 -0000
Mon, 07 Mar 2022 14:41:41 -0000
Mon, 07 Mar 2022 14:41:35 -0000
Mon, 07 Mar 2022 14:41:09 -0000
Mon, 07 Mar 2022 14:40:38 -0000
Mon, 07 Mar 2022 14:39:27 -0000
Mon, 07 Mar 2022 14:39:15 -0000
Mon, 07 Mar 2022 14:39:13 -0000