I am currently trying to scrape the information I want from a website.
The information that I want is contained within a ul>li>em. I have scraped tables before, but I have never scraped lists.
How should I scrape the information I want?
In addition, I want to know if there is a way to make all the innertexts in <em> and put them in a dataframe.
The <ul> basically looks like this.
<ul class="reportData">
<li><em>2015-12-28</em></li>
<li><em>2015-12-28</em></li>
......
<li><em>2015-12-28</em></li>
<li><em>2015-12-28</em></li>
<li><em>2015-12-28</em></li>
</ul>
>Solution :
Just select your <ul> and use in this case stripped_strings to get all text in a list:
data = soup.select_one('ul.reportData').stripped_strings
or more specific with list comprehensionfrom all em
data = [e.text for e in soup.select('ul.reportData em')]
Example
import pandas as pd
from bs4 import BeautifulSoup
html='''
<ul class="reportData">
<li><em>2015-12-28</em></li>
<li><em>2015-12-28</em></li>
<li><em>2015-12-28</em></li>
<li><em>2015-12-28</em></li>
<li><em>2015-12-28</em></li>
</ul>
'''
soup = BeautifulSoup(html)
data = soup.select_one('ul.reportData').stripped_strings
pd.DataFrame(data, columns=['date'])
Output
| date |
|---|
| 2015-12-28 |
| 2015-12-28 |
| 2015-12-28 |
| 2015-12-28 |
| 2015-12-28 |