Follow

Keep Up to Date with the Most Important News

By pressing the Subscribe button, you confirm that you have read and are agreeing to our Privacy Policy and Terms of Use
Contact

Beautiful Soup data extract

Have an local .html from which I am extracting point data, parsed with BeautifulSoup but I don’t know how to extract the date that is inside a div, the parse array is the following:

<div class="_a6-p"><div><div><a href="https://www.instagram.com/chuckbasspics" target="_blank">chuckbasspics</a></div><div>Jan 7, 2013, 5:41 AM</div></div></div><div class="_3-94 _a6-o"></div></div><div class="pam _3-95 _2ph- _a6-g uiBoxWhite noborder"><div class="_a6-p"><div><div>

Any idea how to do it?

I already extracted the users and urls (href) with the following code:

MEDevel.com: Open-source for Healthcare and Education

Collecting and validating open-source software for healthcare, education, enterprise, development, medical imaging, medical records, and digital pathology.

Visit Medevel

fl_html = open('followers.html', "r")
index = fl_html.read()
soup = BeautifulSoup(index, 'lxml')

usernames = soup.find_all('a', href=True)


for i in usernames:
    users.append(i.get_text(strip=True))
    url_follower.append(i['href'])

>Solution :

You can use bs4 API or CSS selector:

from bs4 import BeautifulSoup

html_doc = """<div class="_a6-p"><div><div><a href="https://www.instagram.com/chuckbasspics" target="_blank">chuckbasspics</a></div><div>Jan 7, 2013, 5:41 AM</div></div></div><div class="_3-94 _a6-o"></div></div><div class="pam _3-95 _2ph- _a6-g uiBoxWhite noborder"><div class="_a6-p"><div><div>"""

soup = BeautifulSoup(html_doc, "html.parser")

Extracting the date using .get_text() with separator=

You can get all text from the HTML snippet with custom separator, then .split:

t = soup.get_text(strip=True, separator="|").split("|")
print(t[1])

Prints:

Jan 7, 2013, 5:41 AM

CSS selector

Find next sibling to <div> which contains <a>:

t = soup.select_one("div:has(a) + div")
print(t.text)

Print:

Jan 7, 2013, 5:41 AM

Using bs4 API

Time must contain PM or AM, so select <div> which contains this string:

t = soup.find("div", text=lambda t: t and (" AM" in t or " PM" in t))
print(t.text)

Prints:

Jan 7, 2013, 5:41 AM
Add a comment

Leave a Reply

Keep Up to Date with the Most Important News

By pressing the Subscribe button, you confirm that you have read and are agreeing to our Privacy Policy and Terms of Use

Discover more from Dev solutions

Subscribe now to keep reading and get access to the full archive.

Continue reading