Follow

Keep Up to Date with the Most Important News

By pressing the Subscribe button, you confirm that you have read and are agreeing to our Privacy Policy and Terms of Use
Contact

How do you drop a header from a Pandas Dataframe formed by Scraping a Table using Beautifulsoup? (Python)

I scraped a table from pro-football-reference and created a Dataframe but seem to be running into an issue due to the need to convert the html to a string.

from bs4 import BeautifulSoup as bs
import requests
import pandas as pd
rb_r = requests.get('https://www.pro-football-reference.com/years/2021/rushing.htm')
rb_webpage = bs(rb_r.content, features='lxml')
rb_table = rb_webpage.find('table', attrs={'id': 'rushing'})
rb_df = pd.read_html(str(rb_table))[0]
print(rb_df.head().to_string())

Output:

  Unnamed: 0_level_0 Unnamed: 1_level_0 Unnamed: 2_level_0 Unnamed: 3_level_0 Unnamed: 4_level_0 Games     Rushing                                Unnamed: 14_level_0
                  Rk             Player                 Tm                Age                Pos     G  GS     Att   Yds  TD   1D Lng  Y/A    Y/G                 Fmb
0                  1  Jonathan Taylor*+                IND                 22                 RB    17  17     332  1811  18  107  83  5.5  106.5                   4
1                  2      Najee Harris*                PIT                 23                 RB    17  17     307  1200   7   62  37  3.9   70.6                   0
2                  3         Joe Mixon*                CIN                 25                 RB    16  16     292  1205  13   60  32  4.1   75.3                   2
3                  4     Antonio Gibson                WAS                 23                 RB    16  14     258  1037   7   65  27  4.0   64.8                   6
4                  5       Dalvin Cook*                MIN                 26                 RB    13  13     249  1159   6   57  66  4.7   89.2  

I’m trying to remove the "Unnamed: 0_level_0…" header but everything I try hasn’t worked. Thanks in advance!

MEDevel.com: Open-source for Healthcare and Education

Collecting and validating open-source software for healthcare, education, enterprise, development, medical imaging, medical records, and digital pathology.

Visit Medevel

>Solution :

You’re near to your goal, just add the header parameter to pandas.read_html() to select the correct one:

pd.read_html(str(rb_table), header=1)[0]

Example

from bs4 import BeautifulSoup as bs
import requests
import pandas as pd
rb_r = requests.get('https://www.pro-football-reference.com/years/2021/rushing.htm')
rb_webpage = bs(rb_r.content, features='lxml')
rb_table = rb_webpage.find('table', attrs={'id': 'rushing'})
rb_df = pd.read_html(str(rb_table), header=1)[0]
print(rb_df.head().to_string())

Output

Rk Player Tm Age Pos G GS Att Yds TD 1D Lng Y/A Y/G Fmb
0 1 Jonathan Taylor*+ IND 22 RB 17 17 332 1811 18 107 83 5.5 106.5 4
1 2 Najee Harris* PIT 23 RB 17 17 307 1200 7 62 37 3.9 70.6 0
2 3 Joe Mixon* CIN 25 RB 16 16 292 1205 13 60 32 4.1 75.3 2
3 4 Antonio Gibson WAS 23 RB 16 14 258 1037 7 65 27 4 64.8 6
4 5 Dalvin Cook* MIN 26 RB 13 13 249 1159 6 57 66 4.7 89.2 3
Add a comment

Leave a Reply

Keep Up to Date with the Most Important News

By pressing the Subscribe button, you confirm that you have read and are agreeing to our Privacy Policy and Terms of Use

Discover more from Dev solutions

Subscribe now to keep reading and get access to the full archive.

Continue reading