Follow

Keep Up to Date with the Most Important News

By pressing the Subscribe button, you confirm that you have read and are agreeing to our Privacy Policy and Terms of Use
Contact

BeautifulSoup and Pandas read_html is not pulling all of the rows in a table

When I am scraping a table from a website, it is missing the bottom 5 rows of data and I do not know how to pull them. I am using a combination of BeautifulSoup and Selenium. I thought that they were not loading, so I tried scrolling to the bottom with Selenium, but that still did not work.

Code trials:

site = 'https://fbref.com//en/comps/15/10733/schedule/2020-2021-League-One'
PATH = my_path
driver = webdriver.Chrome(PATH)
driver.get(site)
webpage = bs.BeautifulSoup(driver.page_source, features='html.parser')

table = webpage.find('table', {'class': 'stats_table sortable min_width now_sortable'})
print(table.prettify())
df = pd.read_html(str(table))[0]

print(df.tail())

Please could you help with scraping the full table?

MEDevel.com: Open-source for Healthcare and Education

Collecting and validating open-source software for healthcare, education, enterprise, development, medical imaging, medical records, and digital pathology.

Visit Medevel

>Solution :

Using only Selenium to pull all the rows from the table within the website you need to induce WebDriverWait for the visibility_of_element_located() and using DataFrame from Pandas you can use the following Locator Strategy:

  • Using CSS_SELECTOR:

    tabledata = WebDriverWait(driver, 20).until(EC.visibility_of_element_located((By.CSS_SELECTOR, "table.stats_table.sortable.min_width.now_sortable"))).get_attribute("outerHTML")
    tabledf = pd.read_html(tabledata)
    print(tabledf)
    
  • Using XPATH:

    driver.get('https://fbref.com//en/comps/15/10733/schedule/2020-2021-League-One')
    data = WebDriverWait(driver, 20).until(EC.visibility_of_element_located((By.XPATH, "//table[@class='stats_table sortable min_width now_sortable']"))).get_attribute("outerHTML")
    df = pd.read_html(data)
    print(df)
    
  • Note : You have to add the following imports :

    from selenium.webdriver.support.ui import WebDriverWait
    from selenium.webdriver.common.by import By
    from selenium.webdriver.support import expected_conditions as EC
    
  • Console Output:

    [              Round   Wk  Day  ...             Referee  Match Report                         Notes
    0    Regular Season    1  Sat  ...  Charles Breakspear  Match Report                           NaN
    1    Regular Season    1  Sat  ...       Andrew Davies  Match Report                           NaN
    2    Regular Season    1  Sat  ...       Kevin Johnson  Match Report                           NaN
    3    Regular Season    1  Sat  ...   Anthony Backhouse  Match Report                           NaN
    4    Regular Season    1  Sat  ...        Marc Edwards  Match Report                           NaN
    ..              ...  ...  ...  ...                 ...           ...                           ...
    685     Semi-finals  NaN  Tue  ...       Robert Madley  Match Report                    Leg 1 of 2
    686     Semi-finals  NaN  Wed  ...         Craig Hicks  Match Report                    Leg 1 of 2
    687     Semi-finals  NaN  Fri  ...        Keith Stroud  Match Report     Leg 2 of 2; Blackpool won
    688     Semi-finals  NaN  Sat  ...   Michael Salisbury  Match Report  Leg 2 of 2; Lincoln City won
    689           Final  NaN  Sun  ...     Tony Harrington  Match Report                           NaN
    
        [690 rows x 13 columns]]
    
Add a comment

Leave a Reply

Keep Up to Date with the Most Important News

By pressing the Subscribe button, you confirm that you have read and are agreeing to our Privacy Policy and Terms of Use

Discover more from Dev solutions

Subscribe now to keep reading and get access to the full archive.

Continue reading