Follow

Keep Up to Date with the Most Important News

By pressing the Subscribe button, you confirm that you have read and are agreeing to our Privacy Policy and Terms of Use
Contact

Why does my python "loop for" only work on the last table when I want to scrap tables from list of urls and convert them to df

I’ve got some issues converting table from list of urls to a large Dataframe with all the rows from different urls. It seems that my code runs well however when I want to export a new csv it only returns me the last 10 rows from the last URL instead of each url. Does someone know why ?

ps: I tried to find the answer in browsing Stack but I did not find out

import pandas as pd
from bs4 import BeautifulSoup
import requests
#Pandas/numpy for data manipulation
import numpy as np

# URL 0 - 10 SCRAPE


BASE_URL = [
'https://datan.fr/groupes/legislature-16/re',
'https://datan.fr/groupes/legislature-16/rn',
'https://datan.fr/groupes/legislature-16/lfi-nupes',
    'https://datan.fr/groupes/legislature-16/lr',
    'https://datan.fr/groupes/legislature-16/dem',
    'https://datan.fr/groupes/legislature-16/soc',
    'https://datan.fr/groupes/legislature-16/hor',
    'https://datan.fr/groupes/legislature-16/ecolo',
    'https://datan.fr/groupes/legislature-16/gdr-nupes',
    'https://datan.fr/groupes/legislature-16/liot',
]

Tous_les_groupes = []
b=0
for b in BASE_URL:

    html = requests.get(b).text
    soup = BeautifulSoup(html, "html.parser")
    #identify table we want to scrape
    Tableau_groupe = soup.find('table', {"class" : "table"})
    print(Tableau_groupe)


try:

    for row in Tableau_groupe.find_all('tr'):
        cols = row.find_all('td')
        print(cols)

        if len(cols) == 4:
            Tous_les_groupes.append((b, cols[0].text.strip(), cols[1].text.strip(), cols[2].text.strip(), cols[3].text.strip()))
            #print(Tous_les_groupes)
except:
    pass
Groupes_DF = np.asarray(Tous_les_groupes)
#print(Groupes_DF)
#print(len(Groupes_DF))

df = pd.DataFrame(Groupes_DF)
df.columns = ['url','G', 'Tx', 'note ','Number']
#print(df.head(10))

df.to_csv('output.csv')

Thanks for your help, and all have a great day.

MEDevel.com: Open-source for Healthcare and Education

Collecting and validating open-source software for healthcare, education, enterprise, development, medical imaging, medical records, and digital pathology.

Visit Medevel

>Solution :

In the first loop you assign the result of soup.find to Tableau_groupe, but each time it "overwrites" the previous value, thus mantaining only the last value.

Try moving the second for loop together with the first one:

for b in BASE_URL:

    html = requests.get(b).text
    soup = BeautifulSoup(html, "html.parser")
    #identify table we want to scrape
    Tableau_groupe = soup.find('table', {"class" : "table"})
    print(Tableau_groupe)


    try:

        for row in Tableau_groupe.find_all('tr'):
            cols = row.find_all('td')
            print(cols)

            if len(cols) == 4:
                Tous_les_groupes.append((b, cols[0].text.strip(), cols[1].text.strip(), cols[2].text.strip(), cols[3].text.strip()))

    except:
        pass

Add a comment

Leave a Reply

Keep Up to Date with the Most Important News

By pressing the Subscribe button, you confirm that you have read and are agreeing to our Privacy Policy and Terms of Use

Discover more from Dev solutions

Subscribe now to keep reading and get access to the full archive.

Continue reading