I am writing a code to get all the professors emails from my university as web scraping practice. After what I currently have works I will pass the names through to get their individual pages and then their emails (not worried about that right now). My question is how I can stop the list of retrieved names from including their html data such as:
<h4 class="profile-card__name">Nivea Canalli Bona</h4>, when all I want is "Nivea Canalli Bona"
Is there any way to do this that also makes my life easier when I run a for loop later on to get their individual pages?
import requests
from bs4 import BeautifulSoup
from pprint import pprint
url = 'https://www.bu.edu/com/profiles/faculty/page/1/'
data = requests.get(url)
my_data = []
html = BeautifulSoup(data.text, 'html.parser')
for professor in html:
name = html.select('h4.profile-card__name')
my_data.append({"name": name})
pprint(my_data)
>Solution :
To get the text from within a tag, call .text.
Also, you shouldn’t do this:
for professor in html:
name = html.select('h4.profile-card__name')
your’e just reiterating over the html and selecting all the data again and again. if you print the data you’ll see this in action:
[<h4 class="profile-card__name">David Abel</h4>, <h4 class="profile-card__name">Maria Afzal</h4>, <h4 class="profile-card__name">Shweta Agarwal</h4>, <h4 class="profile-card__name">Michelle Amazeen</h4>, <h4 class="profile-card__name">Christopher Anderson</h4>, <h4 class="profile-card__name">Judith Austin</h4>, <h4 class="profile-card__name">John Baynard</h4>, <h4 class="profile-card__name">Larry Bean</h4>, <h4 class="profile-card__name">Christopher Beaudoin</h4>, <h4 class="profile-card__name">Lisa Liberty Becker</h4>, <h4 class="profile-card__name">Brooks Beisch</h4>, <h4 class="profile-card__name">Jerry Berger</h4>, <h4 class="profile-card__name">Tobe Berkovitz</h4>, <h4 class="profile-card__name">A. Sherrod Blakely</h4>, <h4 class="profile-card__name">Carter Blanchard</h4>, <h4 class="profile-card__name">Lisa Borden</h4>, <h4 class="profile-card__name">Adam Boyajy</h4>, <h4 class="profile-card__name">Bill Braudis</h4>, <h4 class="profile-card__name">Barry Brodsky</h4>, <h4 class="profile-card__name">Tatyana Bronstein</h4>, <h4 class="profile-card__name">Kathryn Burak</h4>, <h4 class="profile-card__name">Asad Butt</h4>, <h4 class="profile-card__name">Nivea Canalli Bona</h4>, <h4 class="profile-card__name">Susan Carlton</h4>]
[<h4 class="profile-card__name">David Abel</h4>, <h4 class="profile-card__name">Maria Afzal</h4>, <h4 class="profile-card__name">Shweta Agarwal</h4>, <h4 class="profile-card__name">Michelle Amazeen</h4>, <h4 class="profile-card__name">Christopher Anderson</h4>, <h4 class="profile-card__name">Judith Austin</h4>, <h4 class="profile-card__name">John Baynard</h4>, <h4 class="profile-card__name">Larry Bean</h4>, <h4 class="profile-card__name">Christopher Beaudoin</h4>, <h4 class="profile-card__name">Lisa Liberty Becker</h4>, <h4 class="profile-card__name">Brooks Beisch</h4>, <h4 class="profile-card__name">Jerry Berger</h4>, <h4 class="profile-card__name">Tobe Berkovitz</h4>, <h4 class="profile-card__name">A. Sherrod Blakely</h4>, <h4 class="profile-card__name">Carter Blanchard</h4>, <h4 class="profile-card__name">Lisa Borden</h4>, <h4 class="profile-card__name">Adam Boyajy</h4>, <h4 class="profile-card__name">Bill Braudis</h4>, <h4 class="profile-card__name">Barry Brodsky</h4>, <h4 class="profile-card__name">Tatyana Bronstein</h4>, <h4 class="profile-card__name">Kathryn Burak</h4>, <h4 class="profile-card__name">Asad Butt</h4>, <h4 class="profile-card__name">Nivea Canalli Bona</h4>, <h4 class="profile-card__name">Susan Carlton</h4>]
[<h4 class="profile-card__name">David Abel</h4>, <h4 class="profile-card__name">Maria Afzal</h4>, <h4 class="profile-card__name">Shweta Agarwal</h4>, <h4 class="profile-card__name">Michelle Amazeen</h4>, <h4 class="profile-card__name">Christopher Anderson</h4>, <h4 class="profile-card__name">Judith Austin</h4>, <h4 class="profile-card__name">John Baynard</h4>, <h4 class="profile-card__name">Larry Bean</h4>, <h4 class="profile-card__name">Christopher Beaudoin</h4>, <h4 class="profile-card__name">Lisa Liberty Becker</h4>, <h4 class="profile-card__name">Brooks Beisch</h4>, <h4 class="profile-card__name">Jerry Berger</h4>, <h4 class="profile-card__name">Tobe Berkovitz</h4>, <h4 class="profile-card__name">A. Sherrod Blakely</h4>, <h4 class="profile-card__name">Carter Blanchard</h4>, <h4 class="profile-card__name">Lisa Borden</h4>, <h4 class="profile-card__name">Adam Boyajy</h4>, <h4 class="profile-card__name">Bill Braudis</h4>, <h4 class="profile-card__name">Barry Brodsky</h4>, <h4 class="profile-card__name">Tatyana Bronstein</h4>, <h4 class="profile-card__name">Kathryn Burak</h4>, <h4 class="profile-card__name">Asad Butt</h4>, <h4 class="profile-card__name">Nivea Canalli Bona</h4>, <h4 class="profile-card__name">Susan Carlton</h4>]
[<h4 class="profile-card__name">David Abel</h4>, <h4 class="profile-card__name">Maria Afzal</h4>, <h4 class="profile-card__name">Shweta Agarwal</h4>, <h4 class="profile-card__name">Michelle Amazeen</h4>, <h4 class="profile-card__name">Christopher Anderson</h4>, <h4 class="profile-card__name">Judith Austin</h4>, <h4 class="profile-card__name">John Baynard</h4>, <h4 class="profile-card__name">Larry Bean</h4>, <h4 class="profile-card__name">Christopher Beaudoin</h4>, <h4 class="profile-card__name">Lisa Liberty Becker</h4>, <h4 class="profile-card__name">Brooks Beisch</h4>, <h4 class="profile-card__name">Jerry Berger</h4>, <h4 class="profile-card__name">Tobe Berkovitz</h4>, <h4 class="profile-card__name">A. Sherrod Blakely</h4>, <h4 class="profile-card__name">Carter Blanchard</h4>, <h4 class="profile-card__name">Lisa Borden</h4>, <h4 class="profile-card__name">Adam Boyajy</h4>, <h4 class="profile-card__name">Bill Braudis</h4>, <h4 class="profile-card__name">Barry Brodsky</h4>, <h4 class="profile-card__name">Tatyana Bronstein</h4>, <h4 class="profile-card__name">Kathryn Burak</h4>, <h4 class="profile-card__name">Asad Butt</h4>, <h4 class="profile-card__name">Nivea Canalli Bona</h4>, <h4 class="profile-card__name">Susan Carlton</h4>]
Instead, your code should look like this:
import requests
from bs4 import BeautifulSoup
url = 'https://www.bu.edu/com/profiles/faculty/page/1/'
data = requests.get(url)
my_data = []
html = BeautifulSoup(data.text, 'html.parser')
professors = html.select('h4.profile-card__name')
for professor in professors:
my_data.append(professor.text)
print(my_data)
Prints:
['David Abel', 'Maria Afzal', 'Shweta Agarwal', 'Michelle Amazeen', 'Christopher Anderson', 'Judith Austin', 'John Baynard', 'Larry Bean', 'Christopher Beaudoin', 'Lisa Liberty Becker', 'Brooks Beisch', 'Jerry Berger', 'Tobe Berkovitz', 'A. Sherrod Blakely', 'Carter Blanchard', 'Lisa Borden', 'Adam Boyajy', 'Bill Braudis', 'Barry Brodsky', 'Tatyana Bronstein', 'Kathryn Burak', 'Asad Butt', 'Nivea Canalli Bona', 'Susan Carlton']