Any way I can get around this issue using BeautifulSoup4?

Advertisements

I have created a Python script using BeautifulSoup4 to check for different Companies’ websites (main website, LinkedIn, and Facebook).

import requests
import bs4
import csv

# Print starting message
print("Starting script...")
print("------")

# Open input and output CSV files
with open('companies.csv', 'r') as f, open('output.csv', 'w', newline='', encoding="utf-8") as f_out:

    # Create CSV reader and writer objects
    reader = csv.reader(f)
    writer = csv.writer(f_out)
    next(reader)

    # Write the header row of the output CSV file
    writer.writerow(['Company', 'URL', 'LinkedIn', 'Facebook'])

    # Loop through each row in the input CSV file
    for row in reader:

        # Get company from row and construct url for google search request
        company = row[0]
        url = 'https://google.com/search?q=' + company
        request_result = requests.get(url)

        # Parse the HTML response using BeautifulSoup
        soup = bs4.BeautifulSoup(request_result.text, "html.parser")
        search_results = soup.find_all("div", class_="BNeawe UPmit AP7Wnd lRVwie")

        print("------")
        try:
            # Extract the first link from the search results
            first_link = search_results[0].getText()
        
            print(f"Links for: {company}")
            print(f"{first_link}")
            
        except IndexError:
            print(f"No search results were found for {company}")
            writer.writerow([company, "", "", "",])
            continue

        # Initialize variables for LinkedIn and Facebook URLs
        linkedin_url = ''
        facebook_url = ''

        # Loop through each link in the search results
        for link in search_results:
            text = link.getText()
            if "linkedin.com" in text or "facebook.com" in text:
                    print(text)

            # If the link is a LinkedIn/FB URL, save it to the linkedin_url
            if "linkedin.com" in text:
                linkedin_url = text
            elif "facebook.com" in text:
                facebook_url = text

        # Write the result row to the output CSV file
        writer.writerow([company, first_link, linkedin_url, facebook_url])

print("Done!")

Unfortunately, when I run this code, everything is being skipped in the try-except block. When I try to read the returned soup, it says:

Our systems have detected unusual traffic from your computer network. This page checks to see if it’s really you sending the requests, and not a robot. This page appears when Google automatically detects requests coming from your computer network which appear to be in violation of the Terms of Service. The block will expire shortly after those requests stop. In the meantime, solving the above CAPTCHA will let you continue to use our services. This traffic may have been sent by malicious software, a browser plug-in, or a script that sends automated requests. If you share your network connection, ask your administrator for help — a different computer using the same IP address may be responsible. Sometimes you may be asked to solve the CAPTCHA if you are using advanced terms that robots are known to use, or sending requests very quickly.

I am wondering if there is anything I can do to keep it running without violating the Terms of Service. My list of companies includes around 2,000-3,000 companies.

>Solution :

You should give your first loop a timeout.

# try 5s  
time.sleep(5)

However, depending on the number of requests you can play with the time.

You could also use mutliple machines, which don’t share the same address, and split your list.

Leave a ReplyCancel reply