I am trying to paginate a scraper on my my university’s website.
Here is the url for one of the pages:
where david-abel is a first followed by last name. (It would be first-middle-last if a middle was given which poses a problem based on my code only finding first and last currently). I have a plan to deal with middle names but my question is:
How do I go about adding names from my first and lastnames list to my base url to get a corresponding url in the layout above
import requests from bs4 import BeautifulSoup url = 'https://www.bu.edu/com/profiles/faculty/page/1/' data = requests.get(url) my_data =  split_names =  firstnames =  lastnames =  middlenames =  html = BeautifulSoup(data.text, 'html.parser') professors = html.select('h4.profile-card__name') for professor in professors: my_data.append(professor.text) for name in my_data: x = name.split() split_names.append(x) for name in split_names: f, l = zip(*split_names) firstnames.append(f) lastnames.append(l) #\/ appending searchable url using names for name in split_names: baseurl = "https://www.bu.edu/com/profile/" newurl = baseurl + print(firstnames) print(lastnames)
Using your method and getting the name splitting then adding "-" between the first-middle-last names can work if you’re sure that the profile link will be that way.
you should extract the URL (
href) instead from the
a tag directly
import requests from bs4 import BeautifulSoup # define an empty list to save data on df_list =  # go through pages from 1 to 7 for page in range(1, 8): # define the current page url url = 'https://www.bu.edu/com/profiles/faculty/page/' + str(page) # make request to the current page response = requests.get(url) # parse the html into a soup object soup = BeautifulSoup(response.text, 'html.parser') # select all the professors professors = soup.select('a.profile-card__link') # go through every professor for professor in professors: # get the name by the class "profile-card__name" name = professor.select_one('.profile-card__name').text # get the name by the class "profile-card__title" title = professor.select_one('.profile-card__title').text # get the "href" of the "a" tag (the profile link) link = professor.get('href') # add the data into the list df_list.append((name, title, link,)) print(df_list)