Home How to accumulate in a df parsed data through a loop with pandas from a web scrapping?

Questions

How to accumulate in a df parsed data through a loop with pandas from a web scrapping?

August 10, 2022

I want to create a df with an historical dataset by scrapping a website, but I struggle to accumulate the full period within the loop. I am able to download a day, but when I try to create a loop to storage a set of iterations I am not able to accumulate the data in the dataframe.

The df I want to create from the start_date to the end_date is as follows:

Fecha	PeríodeTU	TM°C	HRM%
single_date

Where Fecha is a result of adding a columns with the single_date of the code below, and the rest of the columns are actual data from the website scrapped.

I have tried this:

def daterange(start_date, end_date):
    for n in range(int ((end_date - start_date).days)):
        yield start_date + timedelta(n)

start_date = date(2020, 6, 1)
end_date = date(2021, 3, 3)


for single_date in daterange(start_date, end_date):
    #URL API Meteo.cat con la fecha
    url = "https://www.meteo.cat/observacions/xema/dades?codi=V3&dia="+str(single_date)+"T00:00Z"        

    # GET a la API
    res = requests.get(url)
    soup = BeautifulSoup(res.content,'lxml')
    table = soup.find_all('table')[2]
    df_table = pd.read_html(str(table))[0]
    df_table['Fecha'] = single_date


data['Fecha'] = df['Fecha']
data['Hora'] = df['PeríodeTU']
data['Temperatura_Media'] = df['TM°C']
data['Humedad_Relativa'] = df['HRM%']
data.to_csv('Data/tempset.csv', header=True, index=False)

df_table only saves the last date, and I want to save the full period iterated.

Does anyone know how to deal with this situation?

>Solution :

You can create a list and the concatenate it:

dfs = []
for single_date in daterange(start_date, end_date):
    #URL API Meteo.cat con la fecha
    url = "https://www.meteo.cat/observacions/xema/dades?codi=V3&dia="+str(single_date)+"T00:00Z"    

    # GET a la API
    res = requests.get(url)
    soup = BeautifulSoup(res.content,'lxml')
    table = soup.find_all('table')[2]
    dfs.append(pd.read_html(str(table))[0].assign(Fecha = single_date))

And finally after running the loop: