Parsing table data from BeautifulSoup HTML Comment

June 16, 2022

So I am trying to get a table off of https://www.baseball-reference.com/register/team.cgi?id=9995d2a1, specifically the one labeled "Team Pitching", which is hidden in an html comment, preventing me from using pd.read_html() or another simpler method. I have gotten to the point where I have all of the data in a data frame, but my issue is that players with an asterisk in their name because they are left handed dissapear. Meaning their names turn to ‘None’, but I really need to remove the ‘*’ so that the name reads.

This is what I did to get what I have so far with the ‘None’ as a name for lefties:

page = BeautifulSoup(requests.get('https://www.baseball-reference.com/register/team.cgi?id=b0a9f9bc').text, features = 'lxml')

tbls = []
for comment in page.find_all(text=lambda text: isinstance(text, Comment)):
        if comment.find("<table ") > 0:
            comment_soup = BeautifulSoup(comment, 'lxml')
            table = comment_soup.find("table")
            tbls.append(table)

def parse_row(row):
  return [str(x.string) for x in row.find_all('td')]

# pitching table
pitching_tbl = tbls[0]

# html text only used for finding names
html = BeautifulSoup(pitching_tbl.text, features = 'lxml')

rows = pitching_tbl.find_all('tr')
data = pd.DataFrame([parse_row(row) for row in rows])

What I would like to be able to do is loop through the text within the pitching_tbl text, and change it in place if there is an asterisk and use .replace(‘*’, ”), and have the actual html within pitching_tbl be changed.

any help is appriciated!

>Solution :

The desired table data is in html comment.So You can invoke beautifulsoup built-in package which is Comment with lambda function to grab data.

import pandas as pd
import requests
from bs4 import BeautifulSoup
from bs4 import Comment
url='https://www.baseball-reference.com/register/team.cgi?id=9995d2a1'
req=requests.get(url)
soup=BeautifulSoup(req.text,'lxml')
df = pd.read_html([x for x in soup.find_all(string=lambda text: isinstance(text, Comment)) if 'id="div_team_pitching"' in x][0])[0]
print(df)

Output:

 Rk                      Name   Age  W  L   W-L%  ...    H9   HR9   BB9   SO9  SO/W  Notes
0    1.0  Logan Bursick-Harrington  21.0  0  2  0.000  ...   4.5   0.0  15.8  15.8  1.00    NaN
1    2.0                Cylis Cox*  19.0  1  0  1.000  ...  23.1   0.0   7.7  11.6  1.50    NaN
2    3.0          Travis Densmore*  21.0  0  1  0.000  ...   7.2   0.0   1.8  14.4  8.00    NaN
3    4.0             Dylan Freeman  22.0  1  0  1.000  ...  13.5   1.1   3.4  14.6  4.33    NaN
4    5.0              Zach Hopman*  22.0  0  1  0.000  ...  12.8   0.0   9.9  11.4  1.14    NaN
5    6.0            Eamon Horwedel  22.0  1  0  1.000  ...   9.0   0.0   6.4   6.4  1.00    NaN
6    7.0             Tyler Johnson  19.0  0  0    NaN  ...   5.4   0.0   2.7  10.8  4.00    NaN
7    8.0               Trent Jones  20.0  0  0    NaN  ...  14.6   1.1   2.3  12.4  5.50    NaN
8    9.0              Tanner Knapp  21.0  1  1  0.500  ...  11.6   0.0   7.7   4.8  0.63    NaN
9   10.0              Mason Majors  22.0  1  0  1.000  ...   4.9   0.0   7.4  12.3  1.67    NaN
10  11.0               Mason Meeks  21.0  0  1  0.000  ...   6.3   0.9   3.6   5.4  1.50    NaN
11  12.0            Sam Nagelvoort  19.0  0  1  0.000  ...  18.0   2.3  22.5   9.0  0.40    NaN
12  13.0              Tyler Nichol  20.0  0  0    NaN  ...  27.0   0.0  27.0   0.0  0.00    NaN
13  14.0                Cole Russo  19.0  0  0    NaN  ...  27.0  13.5   0.0   0.0   NaN    NaN
14  15.0              Kyle Salley*  22.0  0  1  0.000  ...   9.0   2.3  22.5   9.0  0.40    NaN
15  16.0               Noah Stants  21.0  0  0    NaN  ...   4.3   1.4   7.1  11.4  1.60    NaN
16  17.0         Quinn Waterhouse*  21.0  0  0    NaN  ...   4.5   0.0   4.5  18.0  4.00    NaN
17  18.0              Nick Weyrich  19.0  0  0    NaN  ...   6.4   1.3   7.7  11.6  1.50    NaN
18  19.0              Adam Wheaton  23.0  0  1  0.000  ...  11.7   1.8   4.5  12.6  2.80    NaN
19   NaN                19 Players  20.9  5  9  0.357  ...   9.2   0.8   6.9  10.7  1.55    NaN

[20 rows x 32 columns]