I am trying to paginate a scraper on my my university’s website.
Here is the url for one of the pages:
https://www.bu.edu/com/profile/david-abel/
where david-abel is a first followed by last name. (It would be first-middle-last if a middle was given which poses a problem based on my code only finding first and last currently). I have a plan to deal with middle names but my question is:
How do I go about adding names from my first and lastnames list to my base url to get a corresponding url in the layout above
import requests
from bs4 import BeautifulSoup
url = 'https://www.bu.edu/com/profiles/faculty/page/1/'
data = requests.get(url)
my_data = []
split_names = []
firstnames = []
lastnames = []
middlenames = []
html = BeautifulSoup(data.text, 'html.parser')
professors = html.select('h4.profile-card__name')
for professor in professors:
my_data.append(professor.text)
for name in my_data:
x = name.split()
split_names.append(x)
for name in split_names:
f, l = zip(*split_names)
firstnames.append(f)
lastnames.append(l)
#\/ appending searchable url using names
for name in split_names:
baseurl = "https://www.bu.edu/com/profile/"
newurl = baseurl +
print(firstnames)
print(lastnames)
>Solution :
Using your method and getting the name splitting then adding "-" between the first-middle-last names can work if you’re sure that the profile link will be that way.
you should extract the URL (href
) instead from the a
tag directly
import requests
from bs4 import BeautifulSoup
# define an empty list to save data on
df_list = []
# go through pages from 1 to 7
for page in range(1, 8):
# define the current page url
url = 'https://www.bu.edu/com/profiles/faculty/page/' + str(page)
# make request to the current page
response = requests.get(url)
# parse the html into a soup object
soup = BeautifulSoup(response.text, 'html.parser')
# select all the professors
professors = soup.select('a.profile-card__link')
# go through every professor
for professor in professors:
# get the name by the class "profile-card__name"
name = professor.select_one('.profile-card__name').text
# get the name by the class "profile-card__title"
title = professor.select_one('.profile-card__title').text
# get the "href" of the "a" tag (the profile link)
link = professor.get('href')
# add the data into the list
df_list.append((name, title, link,))
print(df_list)