Follow

Keep Up to Date with the Most Important News

By pressing the Subscribe button, you confirm that you have read and are agreeing to our Privacy Policy and Terms of Use
Contact

Paginating pages using things other than numbers in python

I am trying to paginate a scraper on my my university’s website.
Here is the url for one of the pages:

https://www.bu.edu/com/profile/david-abel/

where david-abel is a first followed by last name. (It would be first-middle-last if a middle was given which poses a problem based on my code only finding first and last currently). I have a plan to deal with middle names but my question is:

MEDevel.com: Open-source for Healthcare and Education

Collecting and validating open-source software for healthcare, education, enterprise, development, medical imaging, medical records, and digital pathology.

Visit Medevel

How do I go about adding names from my first and lastnames list to my base url to get a corresponding url in the layout above

import requests
from bs4 import BeautifulSoup

url = 'https://www.bu.edu/com/profiles/faculty/page/1/'
data = requests.get(url)

my_data = []
split_names = []
firstnames = []
lastnames = []
middlenames = []


html = BeautifulSoup(data.text, 'html.parser')

professors = html.select('h4.profile-card__name')

for professor in professors:
    my_data.append(professor.text)

for name in my_data:
    x = name.split()
    split_names.append(x)

for name in split_names:
    f, l = zip(*split_names)
    firstnames.append(f)
    lastnames.append(l)

#\/ appending searchable url using names
for name in split_names:
    baseurl = "https://www.bu.edu/com/profile/"
    newurl = baseurl + 


print(firstnames)
print(lastnames)

>Solution :

Using your method and getting the name splitting then adding "-" between the first-middle-last names can work if you’re sure that the profile link will be that way.

you should extract the URL (href) instead from the a tag directly

import requests
from bs4 import BeautifulSoup

# define an empty list to save data on
df_list = []

# go through pages from 1 to 7
for page in range(1, 8):

    # define the current page url
    url = 'https://www.bu.edu/com/profiles/faculty/page/' + str(page)

    # make request to the current page
    response = requests.get(url)

    # parse the html into a soup object
    soup = BeautifulSoup(response.text, 'html.parser')

    # select all the professors 
    professors = soup.select('a.profile-card__link')

    # go through every professor 
    for professor in professors:

        # get the name by the class "profile-card__name"
        name = professor.select_one('.profile-card__name').text
        
        # get the name by the class "profile-card__title"
        title = professor.select_one('.profile-card__title').text
        
        # get the "href" of the "a" tag (the profile link)
        link = professor.get('href')

        # add the data into the list
        df_list.append((name, title, link,))

print(df_list)
Add a comment

Leave a Reply

Keep Up to Date with the Most Important News

By pressing the Subscribe button, you confirm that you have read and are agreeing to our Privacy Policy and Terms of Use

Discover more from Dev solutions

Subscribe now to keep reading and get access to the full archive.

Continue reading