Python: better iteration workflow using Beautifulsoup and pandas?

April 3, 2022

folks

I’m working on extracting some sentences from a document and tying to make a dataframe with BeautifulSoup and pandas as follows. There are some iterations so I think it would be written in a better way like a pro. Could you help with developing these lines of code? Thank you!

import pandas as pd
from bs4 import BeautifulSoup

bs = BeautifulSoup(html, 'html.parser')

t1 = bs.find_all('h1')[1].text.replace('_room1',"")
t2 = bs.find_all('h1')[2].text.replace('_room1',"") 
t3 = bs.find_all('h1')[3].text.replace('_room1',"")
t4 = bs.find_all('h1')[4].text.replace('_room1',"")

p1 = bs.find_all('p')[3].text
p2 = bs.find_all('p')[4].text + bs.find_all('p')[5].text + bs.find_all('p')[6].text + bs.find_all('p')[7].text
p3 = bs.find_all('p')[8].text
p4 = bs.find_all('p')[9].text


data = {t1: p1,
      t2: p2,
      t3: p3,
      t4: p4}

df = pd.DataFrame(data, index=[0])

df

>Solution :

How about just getting the text from your H1’s and P’s in one go:

h1s = [h1.text for h1 in bs.select('h1')[:4]]
ps =  [p.text for p in bs.select('p')]

df = pd.DataFrame({
    h1: p
    for h1, p in zip(h1s, [ps[3], ''.join(ps[4:7]), ps[8], ps[9])
}).T