Home How to get the 'name' between two tags using Beautfulsoup while crawling a website?

Questions

How to get the 'name' between two tags using Beautfulsoup while crawling a website?

March 8, 2022

I’m a newbie in this sector. Here is the website I need to crawling "http://py4e-data.dr-chuck.net/comments_1430669.html" and here is it source code "view-source:http://py4e-data.dr-chuck.net/comments_1430669.html" It’s a simple website for practice. The HTML code look something like:

<html>
<head>
<title>Welcome to the comments assignment from www.py4e.com</title>
</head>
<body>
<h1>This file contains the actual data for your assignment - good luck!</h1>

<table border="2">
<tr>
<td>Name</td><td>Comments</td>
</tr>
<tr><td>Melodie</td><td><span class="comments">100</span></td></tr>
<tr><td>Machaela</td><td><span class="comments">100</span></td></tr>
<tr><td>Rhoan</td><td><span class="comments">99</span></td></tr>

I need to get the name between two (Melodie,Machaela,Rhoan) Below is my code:

html=urllib.request.urlopen('http://py4e-data.dr-chuck.net/comments_1430669.html').read()

soup=BeautifulSoup(html,'html.parser')

    for line in soup.find_all('tr'):
        print(line) #Result:
    
#===============================================================================
# <tr>
# <td>Name</td><td>Comments</td>
# </tr>
# <tr><td>Melodie</td><td><span class="comments">100</span></td></tr>
# <tr><td>Machaela</td><td><span class="comments">100</span></td></tr>
# <tr><td>Rhoan</td><td><span class="comments">99</span></td></tr>
# ..........
#===============================================================================

The tricky part is there is also "<td><tr>" at the end of the line too so Python just takes it all. I’m thinking about a regex solution (find string between 2 substrings), but I want to do it in a Beautifulsoup way.

>Solution :

Just select the first <td> in the <tr> to get its text:

for e in soup.find_all('tr'):
    print(e.td.text)

To avoid getting the header "name" operate on a sliced ResultSet:

for e in soup.find_all('tr')[1:]:
    print(e.td.text)

Example

html=urllib.request.urlopen('http://py4e-data.dr-chuck.net/comments_1430669.html').read()

soup=BeautifulSoup(html,'html.parser')

for e in soup.find_all('tr'):
    print(e.td.text)