I’m a newbie in this sector. Here is the website I need to crawling "http://py4e-data.dr-chuck.net/comments_1430669.html" and here is it source code "view-source:http://py4e-data.dr-chuck.net/comments_1430669.html" It’s a simple website for practice. The HTML code look something like:
<html>
<head>
<title>Welcome to the comments assignment from www.py4e.com</title>
</head>
<body>
<h1>This file contains the actual data for your assignment - good luck!</h1>
<table border="2">
<tr>
<td>Name</td><td>Comments</td>
</tr>
<tr><td>Melodie</td><td><span class="comments">100</span></td></tr>
<tr><td>Machaela</td><td><span class="comments">100</span></td></tr>
<tr><td>Rhoan</td><td><span class="comments">99</span></td></tr>
I need to get the name between two (Melodie,Machaela,Rhoan) Below is my code:
html=urllib.request.urlopen('http://py4e-data.dr-chuck.net/comments_1430669.html').read()
soup=BeautifulSoup(html,'html.parser')
for line in soup.find_all('tr'):
print(line) #Result:
#===============================================================================
# <tr>
# <td>Name</td><td>Comments</td>
# </tr>
# <tr><td>Melodie</td><td><span class="comments">100</span></td></tr>
# <tr><td>Machaela</td><td><span class="comments">100</span></td></tr>
# <tr><td>Rhoan</td><td><span class="comments">99</span></td></tr>
# ..........
#===============================================================================
The tricky part is there is also "<td><tr>" at the end of the line too so Python just takes it all. I’m thinking about a regex solution (find string between 2 substrings), but I want to do it in a Beautifulsoup way.
>Solution :
Just select the first <td> in the <tr> to get its text:
for e in soup.find_all('tr'):
print(e.td.text)
To avoid getting the header "name" operate on a sliced ResultSet:
for e in soup.find_all('tr')[1:]:
print(e.td.text)
Example
html=urllib.request.urlopen('http://py4e-data.dr-chuck.net/comments_1430669.html').read()
soup=BeautifulSoup(html,'html.parser')
for e in soup.find_all('tr'):
print(e.td.text)