Follow

Keep Up to Date with the Most Important News

By pressing the Subscribe button, you confirm that you have read and are agreeing to our Privacy Policy and Terms of Use
Contact

Parsing table data from BeautifulSoup HTML Comment

So I am trying to get a table off of https://www.baseball-reference.com/register/team.cgi?id=9995d2a1, specifically the one labeled "Team Pitching", which is hidden in an html comment, preventing me from using pd.read_html() or another simpler method. I have gotten to the point where I have all of the data in a data frame, but my issue is that players with an asterisk in their name because they are left handed dissapear. Meaning their names turn to ‘None’, but I really need to remove the ‘*’ so that the name reads.

This is what I did to get what I have so far with the ‘None’ as a name for lefties:

page = BeautifulSoup(requests.get('https://www.baseball-reference.com/register/team.cgi?id=b0a9f9bc').text, features = 'lxml')

tbls = []
for comment in page.find_all(text=lambda text: isinstance(text, Comment)):
        if comment.find("<table ") > 0:
            comment_soup = BeautifulSoup(comment, 'lxml')
            table = comment_soup.find("table")
            tbls.append(table)

def parse_row(row):
  return [str(x.string) for x in row.find_all('td')]

# pitching table
pitching_tbl = tbls[0]

# html text only used for finding names
html = BeautifulSoup(pitching_tbl.text, features = 'lxml')

rows = pitching_tbl.find_all('tr')
data = pd.DataFrame([parse_row(row) for row in rows])

What I would like to be able to do is loop through the text within the pitching_tbl text, and change it in place if there is an asterisk and use .replace(‘*’, ”), and have the actual html within pitching_tbl be changed.

MEDevel.com: Open-source for Healthcare and Education

Collecting and validating open-source software for healthcare, education, enterprise, development, medical imaging, medical records, and digital pathology.

Visit Medevel

Output

any help is appriciated!

>Solution :

The desired table data is in html comment.So You can invoke beautifulsoup built-in package which is Comment with lambda function to grab data.

import pandas as pd
import requests
from bs4 import BeautifulSoup
from bs4 import Comment
url='https://www.baseball-reference.com/register/team.cgi?id=9995d2a1'
req=requests.get(url)
soup=BeautifulSoup(req.text,'lxml')
df = pd.read_html([x for x in soup.find_all(string=lambda text: isinstance(text, Comment)) if 'id="div_team_pitching"' in x][0])[0]
print(df)

Output:

 Rk                      Name   Age  W  L   W-L%  ...    H9   HR9   BB9   SO9  SO/W  Notes
0    1.0  Logan Bursick-Harrington  21.0  0  2  0.000  ...   4.5   0.0  15.8  15.8  1.00    NaN
1    2.0                Cylis Cox*  19.0  1  0  1.000  ...  23.1   0.0   7.7  11.6  1.50    NaN
2    3.0          Travis Densmore*  21.0  0  1  0.000  ...   7.2   0.0   1.8  14.4  8.00    NaN
3    4.0             Dylan Freeman  22.0  1  0  1.000  ...  13.5   1.1   3.4  14.6  4.33    NaN
4    5.0              Zach Hopman*  22.0  0  1  0.000  ...  12.8   0.0   9.9  11.4  1.14    NaN
5    6.0            Eamon Horwedel  22.0  1  0  1.000  ...   9.0   0.0   6.4   6.4  1.00    NaN
6    7.0             Tyler Johnson  19.0  0  0    NaN  ...   5.4   0.0   2.7  10.8  4.00    NaN
7    8.0               Trent Jones  20.0  0  0    NaN  ...  14.6   1.1   2.3  12.4  5.50    NaN
8    9.0              Tanner Knapp  21.0  1  1  0.500  ...  11.6   0.0   7.7   4.8  0.63    NaN
9   10.0              Mason Majors  22.0  1  0  1.000  ...   4.9   0.0   7.4  12.3  1.67    NaN
10  11.0               Mason Meeks  21.0  0  1  0.000  ...   6.3   0.9   3.6   5.4  1.50    NaN
11  12.0            Sam Nagelvoort  19.0  0  1  0.000  ...  18.0   2.3  22.5   9.0  0.40    NaN
12  13.0              Tyler Nichol  20.0  0  0    NaN  ...  27.0   0.0  27.0   0.0  0.00    NaN
13  14.0                Cole Russo  19.0  0  0    NaN  ...  27.0  13.5   0.0   0.0   NaN    NaN
14  15.0              Kyle Salley*  22.0  0  1  0.000  ...   9.0   2.3  22.5   9.0  0.40    NaN
15  16.0               Noah Stants  21.0  0  0    NaN  ...   4.3   1.4   7.1  11.4  1.60    NaN
16  17.0         Quinn Waterhouse*  21.0  0  0    NaN  ...   4.5   0.0   4.5  18.0  4.00    NaN
17  18.0              Nick Weyrich  19.0  0  0    NaN  ...   6.4   1.3   7.7  11.6  1.50    NaN
18  19.0              Adam Wheaton  23.0  0  1  0.000  ...  11.7   1.8   4.5  12.6  2.80    NaN
19   NaN                19 Players  20.9  5  9  0.357  ...   9.2   0.8   6.9  10.7  1.55    NaN

[20 rows x 32 columns]
Add a comment

Leave a Reply

Keep Up to Date with the Most Important News

By pressing the Subscribe button, you confirm that you have read and are agreeing to our Privacy Policy and Terms of Use

Discover more from Dev solutions

Subscribe now to keep reading and get access to the full archive.

Continue reading