Paginating pages using things other than numbers in python

December 30, 2022

I am trying to paginate a scraper on my my university’s website.
Here is the url for one of the pages:

https://www.bu.edu/com/profile/david-abel/

where david-abel is a first followed by last name. (It would be first-middle-last if a middle was given which poses a problem based on my code only finding first and last currently). I have a plan to deal with middle names but my question is:

How do I go about adding names from my first and lastnames list to my base url to get a corresponding url in the layout above

import requests
from bs4 import BeautifulSoup

url = 'https://www.bu.edu/com/profiles/faculty/page/1/'
data = requests.get(url)

my_data = []
split_names = []
firstnames = []
lastnames = []
middlenames = []


html = BeautifulSoup(data.text, 'html.parser')

professors = html.select('h4.profile-card__name')

for professor in professors:
    my_data.append(professor.text)

for name in my_data:
    x = name.split()
    split_names.append(x)

for name in split_names:
    f, l = zip(*split_names)
    firstnames.append(f)
    lastnames.append(l)

#\/ appending searchable url using names
for name in split_names:
    baseurl = "https://www.bu.edu/com/profile/"
    newurl = baseurl + 


print(firstnames)
print(lastnames)

>Solution :

Using your method and getting the name splitting then adding "-" between the first-middle-last names can work if you’re sure that the profile link will be that way.

you should extract the URL (href) instead from the a tag directly

import requests
from bs4 import BeautifulSoup

# define an empty list to save data on
df_list = []

# go through pages from 1 to 7
for page in range(1, 8):

    # define the current page url
    url = 'https://www.bu.edu/com/profiles/faculty/page/' + str(page)

    # make request to the current page
    response = requests.get(url)

    # parse the html into a soup object
    soup = BeautifulSoup(response.text, 'html.parser')

    # select all the professors 
    professors = soup.select('a.profile-card__link')

    # go through every professor 
    for professor in professors:

        # get the name by the class "profile-card__name"
        name = professor.select_one('.profile-card__name').text
        
        # get the name by the class "profile-card__title"
        title = professor.select_one('.profile-card__title').text
        
        # get the "href" of the "a" tag (the profile link)
        link = professor.get('href')

        # add the data into the list
        df_list.append((name, title, link,))

print(df_list)