Follow

Keep Up to Date with the Most Important News

By pressing the Subscribe button, you confirm that you have read and are agreeing to our Privacy Policy and Terms of Use
Contact

How to get rid of html tag line in my data list

I am writing a code to get all the professors emails from my university as web scraping practice. After what I currently have works I will pass the names through to get their individual pages and then their emails (not worried about that right now). My question is how I can stop the list of retrieved names from including their html data such as:
<h4 class="profile-card__name">Nivea Canalli Bona</h4>, when all I want is "Nivea Canalli Bona"

Is there any way to do this that also makes my life easier when I run a for loop later on to get their individual pages?

import requests
from bs4 import BeautifulSoup
from pprint import pprint

url = 'https://www.bu.edu/com/profiles/faculty/page/1/'
data = requests.get(url)

my_data = []

html = BeautifulSoup(data.text, 'html.parser')

for professor in html:

    name = html.select('h4.profile-card__name')

    my_data.append({"name": name})

pprint(my_data)

MEDevel.com: Open-source for Healthcare and Education

Collecting and validating open-source software for healthcare, education, enterprise, development, medical imaging, medical records, and digital pathology.

Visit Medevel

>Solution :

To get the text from within a tag, call .text.

Also, you shouldn’t do this:

for professor in html:
    name = html.select('h4.profile-card__name')

your’e just reiterating over the html and selecting all the data again and again. if you print the data you’ll see this in action:

[<h4 class="profile-card__name">David Abel</h4>, <h4 class="profile-card__name">Maria Afzal</h4>, <h4 class="profile-card__name">Shweta Agarwal</h4>, <h4 class="profile-card__name">Michelle Amazeen</h4>, <h4 class="profile-card__name">Christopher Anderson</h4>, <h4 class="profile-card__name">Judith Austin</h4>, <h4 class="profile-card__name">John Baynard</h4>, <h4 class="profile-card__name">Larry Bean</h4>, <h4 class="profile-card__name">Christopher Beaudoin</h4>, <h4 class="profile-card__name">Lisa Liberty Becker</h4>, <h4 class="profile-card__name">Brooks Beisch</h4>, <h4 class="profile-card__name">Jerry Berger</h4>, <h4 class="profile-card__name">Tobe Berkovitz</h4>, <h4 class="profile-card__name">A. Sherrod Blakely</h4>, <h4 class="profile-card__name">Carter Blanchard</h4>, <h4 class="profile-card__name">Lisa Borden</h4>, <h4 class="profile-card__name">Adam Boyajy</h4>, <h4 class="profile-card__name">Bill Braudis</h4>, <h4 class="profile-card__name">Barry Brodsky</h4>, <h4 class="profile-card__name">Tatyana Bronstein</h4>, <h4 class="profile-card__name">Kathryn Burak</h4>, <h4 class="profile-card__name">Asad Butt</h4>, <h4 class="profile-card__name">Nivea Canalli Bona</h4>, <h4 class="profile-card__name">Susan Carlton</h4>]
[<h4 class="profile-card__name">David Abel</h4>, <h4 class="profile-card__name">Maria Afzal</h4>, <h4 class="profile-card__name">Shweta Agarwal</h4>, <h4 class="profile-card__name">Michelle Amazeen</h4>, <h4 class="profile-card__name">Christopher Anderson</h4>, <h4 class="profile-card__name">Judith Austin</h4>, <h4 class="profile-card__name">John Baynard</h4>, <h4 class="profile-card__name">Larry Bean</h4>, <h4 class="profile-card__name">Christopher Beaudoin</h4>, <h4 class="profile-card__name">Lisa Liberty Becker</h4>, <h4 class="profile-card__name">Brooks Beisch</h4>, <h4 class="profile-card__name">Jerry Berger</h4>, <h4 class="profile-card__name">Tobe Berkovitz</h4>, <h4 class="profile-card__name">A. Sherrod Blakely</h4>, <h4 class="profile-card__name">Carter Blanchard</h4>, <h4 class="profile-card__name">Lisa Borden</h4>, <h4 class="profile-card__name">Adam Boyajy</h4>, <h4 class="profile-card__name">Bill Braudis</h4>, <h4 class="profile-card__name">Barry Brodsky</h4>, <h4 class="profile-card__name">Tatyana Bronstein</h4>, <h4 class="profile-card__name">Kathryn Burak</h4>, <h4 class="profile-card__name">Asad Butt</h4>, <h4 class="profile-card__name">Nivea Canalli Bona</h4>, <h4 class="profile-card__name">Susan Carlton</h4>]
[<h4 class="profile-card__name">David Abel</h4>, <h4 class="profile-card__name">Maria Afzal</h4>, <h4 class="profile-card__name">Shweta Agarwal</h4>, <h4 class="profile-card__name">Michelle Amazeen</h4>, <h4 class="profile-card__name">Christopher Anderson</h4>, <h4 class="profile-card__name">Judith Austin</h4>, <h4 class="profile-card__name">John Baynard</h4>, <h4 class="profile-card__name">Larry Bean</h4>, <h4 class="profile-card__name">Christopher Beaudoin</h4>, <h4 class="profile-card__name">Lisa Liberty Becker</h4>, <h4 class="profile-card__name">Brooks Beisch</h4>, <h4 class="profile-card__name">Jerry Berger</h4>, <h4 class="profile-card__name">Tobe Berkovitz</h4>, <h4 class="profile-card__name">A. Sherrod Blakely</h4>, <h4 class="profile-card__name">Carter Blanchard</h4>, <h4 class="profile-card__name">Lisa Borden</h4>, <h4 class="profile-card__name">Adam Boyajy</h4>, <h4 class="profile-card__name">Bill Braudis</h4>, <h4 class="profile-card__name">Barry Brodsky</h4>, <h4 class="profile-card__name">Tatyana Bronstein</h4>, <h4 class="profile-card__name">Kathryn Burak</h4>, <h4 class="profile-card__name">Asad Butt</h4>, <h4 class="profile-card__name">Nivea Canalli Bona</h4>, <h4 class="profile-card__name">Susan Carlton</h4>]
[<h4 class="profile-card__name">David Abel</h4>, <h4 class="profile-card__name">Maria Afzal</h4>, <h4 class="profile-card__name">Shweta Agarwal</h4>, <h4 class="profile-card__name">Michelle Amazeen</h4>, <h4 class="profile-card__name">Christopher Anderson</h4>, <h4 class="profile-card__name">Judith Austin</h4>, <h4 class="profile-card__name">John Baynard</h4>, <h4 class="profile-card__name">Larry Bean</h4>, <h4 class="profile-card__name">Christopher Beaudoin</h4>, <h4 class="profile-card__name">Lisa Liberty Becker</h4>, <h4 class="profile-card__name">Brooks Beisch</h4>, <h4 class="profile-card__name">Jerry Berger</h4>, <h4 class="profile-card__name">Tobe Berkovitz</h4>, <h4 class="profile-card__name">A. Sherrod Blakely</h4>, <h4 class="profile-card__name">Carter Blanchard</h4>, <h4 class="profile-card__name">Lisa Borden</h4>, <h4 class="profile-card__name">Adam Boyajy</h4>, <h4 class="profile-card__name">Bill Braudis</h4>, <h4 class="profile-card__name">Barry Brodsky</h4>, <h4 class="profile-card__name">Tatyana Bronstein</h4>, <h4 class="profile-card__name">Kathryn Burak</h4>, <h4 class="profile-card__name">Asad Butt</h4>, <h4 class="profile-card__name">Nivea Canalli Bona</h4>, <h4 class="profile-card__name">Susan Carlton</h4>]

Instead, your code should look like this:

import requests
from bs4 import BeautifulSoup


url = 'https://www.bu.edu/com/profiles/faculty/page/1/'
data = requests.get(url)

my_data = []

html = BeautifulSoup(data.text, 'html.parser')

professors = html.select('h4.profile-card__name')

for professor in professors:
    my_data.append(professor.text)

print(my_data)

Prints:

['David Abel', 'Maria Afzal', 'Shweta Agarwal', 'Michelle Amazeen', 'Christopher Anderson', 'Judith Austin', 'John Baynard', 'Larry Bean', 'Christopher Beaudoin', 'Lisa Liberty Becker', 'Brooks Beisch', 'Jerry Berger', 'Tobe Berkovitz', 'A. Sherrod Blakely', 'Carter Blanchard', 'Lisa Borden', 'Adam Boyajy', 'Bill Braudis', 'Barry Brodsky', 'Tatyana Bronstein', 'Kathryn Burak', 'Asad Butt', 'Nivea Canalli Bona', 'Susan Carlton']
Add a comment

Leave a Reply

Keep Up to Date with the Most Important News

By pressing the Subscribe button, you confirm that you have read and are agreeing to our Privacy Policy and Terms of Use

Discover more from Dev solutions

Subscribe now to keep reading and get access to the full archive.

Continue reading