Home Beautiful Soup data extract

Questions

Beautiful Soup data extract

November 21, 2022

Have an local .html from which I am extracting point data, parsed with BeautifulSoup but I don’t know how to extract the date that is inside a div, the parse array is the following:

<div class="_a6-p"><div><div><a href="https://www.instagram.com/chuckbasspics" target="_blank">chuckbasspics</a></div><div>Jan 7, 2013, 5:41 AM</div></div></div><div class="_3-94 _a6-o"></div></div><div class="pam _3-95 _2ph- _a6-g uiBoxWhite noborder"><div class="_a6-p"><div><div>

Any idea how to do it?

I already extracted the users and urls (href) with the following code:

fl_html = open('followers.html', "r")
index = fl_html.read()
soup = BeautifulSoup(index, 'lxml')

usernames = soup.find_all('a', href=True)


for i in usernames:
    users.append(i.get_text(strip=True))
    url_follower.append(i['href'])

>Solution :

You can use bs4 API or CSS selector:

from bs4 import BeautifulSoup

html_doc = """<div class="_a6-p"><div><div><a href="https://www.instagram.com/chuckbasspics" target="_blank">chuckbasspics</a></div><div>Jan 7, 2013, 5:41 AM</div></div></div><div class="_3-94 _a6-o"></div></div><div class="pam _3-95 _2ph- _a6-g uiBoxWhite noborder"><div class="_a6-p"><div><div>"""

soup = BeautifulSoup(html_doc, "html.parser")

Extracting the date using `.get_text()` with `separator=`

You can get all text from the HTML snippet with custom separator, then .split:

t = soup.get_text(strip=True, separator="|").split("|")
print(t[1])

Prints:

Jan 7, 2013, 5:41 AM

CSS selector

Find next sibling to <div> which contains <a>:

t = soup.select_one("div:has(a) + div")
print(t.text)

Print:

Jan 7, 2013, 5:41 AM

Using `bs4` API

Time must contain PM or AM, so select <div> which contains this string:

t = soup.find("div", text=lambda t: t and (" AM" in t or " PM" in t))
print(t.text)

Prints:

Jan 7, 2013, 5:41 AM

beautifulsoup

byMR

Published November 21, 2022

Add a comment

Python recursive generator breaks when using list() and append() keywords

byMR

November 21, 2022

Questions

How to index a heap allocated matrix in C?

byMR

November 21, 2022

Questions

How to pass on form inputs and add them to an array of objects?

byMR

November 21, 2022

Questions

The getAttribute function is not always retrieving the attribute value

byMR

November 21, 2022

Questions

How to set generated string as a unique session key?

byMR

November 21, 2022

Questions

If allocators are stateless in C++, why are functions not used to allocate memory instead?

byMR

November 21, 2022

Beautiful Soup data extract

MEDevel.com: Open-source for Healthcare and Education

>Solution :

Extracting the date using `.get_text()` with `separator=`

CSS selector

Using `bs4` API

Like this:

Leave a ReplyCancel reply

Read more

Python recursive generator breaks when using list() and append() keywords

How to index a heap allocated matrix in C?

How to pass on form inputs and add them to an array of objects?

The getAttribute function is not always retrieving the attribute value

How to set generated string as a unique session key?

If allocators are stateless in C++, why are functions not used to allocate memory instead?

Keep Up to Date with the Most Important News

Beautiful Soup data extract

MEDevel.com: Open-source for Healthcare and Education

>Solution :

Extracting the date using .get_text() with separator=

CSS selector

Using bs4 API

Share this:

Like this:

Leave a ReplyCancel reply

Keep Up to Date with the Most Important News

Read more

Python recursive generator breaks when using list() and append() keywords

How to index a heap allocated matrix in C?

How to pass on form inputs and add them to an array of objects?

The getAttribute function is not always retrieving the attribute value

How to set generated string as a unique session key?

If allocators are stateless in C++, why are functions not used to allocate memory instead?

Discover more from Dev solutions

Extracting the date using `.get_text()` with `separator=`

Using `bs4` API