Home How to change the code to asynchronously iterate links and IDs for scrap web page?

Questions

How to change the code to asynchronously iterate links and IDs for scrap web page?

January 17, 2022

I have the list of links, each link has an id that is in the Id list

How to change the code so that when iterating the link, the corresponding id is substituted into the string:

All code is below:

import pandas as pd
from bs4 import BeautifulSoup
import requests

HEADERS = {'user-agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) '
                         'Chrome/81.0.4044.138 Safari/537.36 OPR/68.0.3618.125', 'accept': '*/*'}
links = ['https://www..ie', 'https://www..ch', 'https://www..com']
Id = ['164240372761e5178f0488d', '164240372661e5178e1b377', '164240365661e517481a1e6']

def get_html(url, params=None):
    r = requests.get(url, headers=HEADERS, params=params)

def get_data_no_products(html):
    data = []
    soup = BeautifulSoup(html, 'html.parser')
    items = soup.find_all('div', id= '') # How to iteration paste id???????

    for item in items:
        data.append({'pn': item.find('a').get('href')})

    return print(data)

def parse():
    for i in links:
        html = get_html(i)
        get_data_no_products(html.text)
parse()

>Solution :

Parametrise your code:

def get_data_no_products(html, id_):
    data = []
    soup = BeautifulSoup(html, 'html.parser')
    items = soup.find_all('div', id=id_)

And then use zip():

for link, id_ in zip(links, ids):
    get_data_no_producs(link, id_)

Note that there’s a likely bug in your code: you return print(data) which will always be none. You likely just want to return data.

PS

There is another solution to this which you will frequently encounter from people beginning in python:

for i in range(len(links)):
    link = links[i]
    id_ = ids[i]
    ...

This… works. It might even be easier or more natural, if you are coming from e.g. C. (Then again I’d likely use pointers…). Style is very much personal, but if you’re going to write in a high level language like python you might as well avoid thinking about things like ‘the index of the current item’ as much as possible. Just my £0.02.