Follow

Keep Up to Date with the Most Important News

By pressing the Subscribe button, you confirm that you have read and are agreeing to our Privacy Policy and Terms of Use
Contact

How to get the 'name' between two tags using Beautfulsoup while crawling a website?

I’m a newbie in this sector. Here is the website I need to crawling "http://py4e-data.dr-chuck.net/comments_1430669.html" and here is it source code "view-source:http://py4e-data.dr-chuck.net/comments_1430669.html" It’s a simple website for practice. The HTML code look something like:

<html>
<head>
<title>Welcome to the comments assignment from www.py4e.com</title>
</head>
<body>
<h1>This file contains the actual data for your assignment - good luck!</h1>

<table border="2">
<tr>
<td>Name</td><td>Comments</td>
</tr>
<tr><td>Melodie</td><td><span class="comments">100</span></td></tr>
<tr><td>Machaela</td><td><span class="comments">100</span></td></tr>
<tr><td>Rhoan</td><td><span class="comments">99</span></td></tr>

I need to get the name between two (Melodie,Machaela,Rhoan) Below is my code:

html=urllib.request.urlopen('http://py4e-data.dr-chuck.net/comments_1430669.html').read()

soup=BeautifulSoup(html,'html.parser')

    for line in soup.find_all('tr'):
        print(line) #Result:
    
#===============================================================================
# <tr>
# <td>Name</td><td>Comments</td>
# </tr>
# <tr><td>Melodie</td><td><span class="comments">100</span></td></tr>
# <tr><td>Machaela</td><td><span class="comments">100</span></td></tr>
# <tr><td>Rhoan</td><td><span class="comments">99</span></td></tr>
# ..........
#===============================================================================

The tricky part is there is also "<td><tr>" at the end of the line too so Python just takes it all. I’m thinking about a regex solution (find string between 2 substrings), but I want to do it in a Beautifulsoup way.

MEDevel.com: Open-source for Healthcare and Education

Collecting and validating open-source software for healthcare, education, enterprise, development, medical imaging, medical records, and digital pathology.

Visit Medevel

>Solution :

Just select the first <td> in the <tr> to get its text:

for e in soup.find_all('tr'):
    print(e.td.text)

To avoid getting the header "name" operate on a sliced ResultSet:

for e in soup.find_all('tr')[1:]:
    print(e.td.text)

Example

html=urllib.request.urlopen('http://py4e-data.dr-chuck.net/comments_1430669.html').read()

soup=BeautifulSoup(html,'html.parser')

for e in soup.find_all('tr'):
    print(e.td.text)
Add a comment

Leave a Reply

Keep Up to Date with the Most Important News

By pressing the Subscribe button, you confirm that you have read and are agreeing to our Privacy Policy and Terms of Use

Discover more from Dev solutions

Subscribe now to keep reading and get access to the full archive.

Continue reading