Home How do you drop a header from a Pandas Dataframe formed by Scraping a Table using Beautifulsoup? (Python)

Questions

How do you drop a header from a Pandas Dataframe formed by Scraping a Table using Beautifulsoup? (Python)

March 8, 2022

I scraped a table from pro-football-reference and created a Dataframe but seem to be running into an issue due to the need to convert the html to a string.

from bs4 import BeautifulSoup as bs
import requests
import pandas as pd
rb_r = requests.get('https://www.pro-football-reference.com/years/2021/rushing.htm')
rb_webpage = bs(rb_r.content, features='lxml')
rb_table = rb_webpage.find('table', attrs={'id': 'rushing'})
rb_df = pd.read_html(str(rb_table))[0]
print(rb_df.head().to_string())

Output:

  Unnamed: 0_level_0 Unnamed: 1_level_0 Unnamed: 2_level_0 Unnamed: 3_level_0 Unnamed: 4_level_0 Games     Rushing                                Unnamed: 14_level_0
                  Rk             Player                 Tm                Age                Pos     G  GS     Att   Yds  TD   1D Lng  Y/A    Y/G                 Fmb
0                  1  Jonathan Taylor*+                IND                 22                 RB    17  17     332  1811  18  107  83  5.5  106.5                   4
1                  2      Najee Harris*                PIT                 23                 RB    17  17     307  1200   7   62  37  3.9   70.6                   0
2                  3         Joe Mixon*                CIN                 25                 RB    16  16     292  1205  13   60  32  4.1   75.3                   2
3                  4     Antonio Gibson                WAS                 23                 RB    16  14     258  1037   7   65  27  4.0   64.8                   6
4                  5       Dalvin Cook*                MIN                 26                 RB    13  13     249  1159   6   57  66  4.7   89.2

I’m trying to remove the "Unnamed: 0_level_0…" header but everything I try hasn’t worked. Thanks in advance!

>Solution :

You’re near to your goal, just add the header parameter to pandas.read_html() to select the correct one:

pd.read_html(str(rb_table), header=1)[0]

Example

from bs4 import BeautifulSoup as bs
import requests
import pandas as pd
rb_r = requests.get('https://www.pro-football-reference.com/years/2021/rushing.htm')
rb_webpage = bs(rb_r.content, features='lxml')
rb_table = rb_webpage.find('table', attrs={'id': 'rushing'})
rb_df = pd.read_html(str(rb_table), header=1)[0]
print(rb_df.head().to_string())

Output

	Rk	Player	Tm	Age	Pos	G	GS	Att	Yds	TD	1D	Lng	Y/A	Y/G	Fmb
0	1	Jonathan Taylor*+	IND	22	RB	17	17	332	1811	18	107	83	5.5	106.5	4
1	2	Najee Harris*	PIT	23	RB	17	17	307	1200	7	62	37	3.9	70.6	0
2	3	Joe Mixon*	CIN	25	RB	16	16	292	1205	13	60	32	4.1	75.3	2
3	4	Antonio Gibson	WAS	23	RB	16	14	258	1037	7	65	27	4	64.8	6
4	5	Dalvin Cook*	MIN	26	RB	13	13	249	1159	6	57	66	4.7	89.2	3