Home Get links from summary section of wikipedia page: extract all the links from this wiki-page with Python

Questions

Get links from summary section of wikipedia page: extract all the links from this wiki-page with Python

May 15, 2024

howdy i am trying to scrape all the Links of a large wikpedia page from the "List of Towns and Gemeinden in Bayern" on Wikipedia using python. The trouble is that I cannot figure out how to export all of the links containing the words "/wiki/" to my CSV file. I am used to Python a bit but some things are still kinda of foreign to me. Any ideas? Here is what I have so far…

the page: https://de.wikipedia.org/wiki/Liste_der_St%C3%A4dte_und_Gemeinden_in_Bayern#A

from bs4 import BeautifulSoup as bs
import requests

res = requests.get("https://en.wikipedia.org/wiki/Liste_der_St%C3%A4dte_und_Gemeinden_in_Bayern#A")
soup = bs(res.text, "html.parser")
gemeinden_in_bayern = {}
for link in soup.find_all("a"):
    url = link.get("href", "")
    if "/wiki/" in url:
        gemeinden_in_bayern[link.text.strip()] = url

print(gemeinden_in_bayern)

the results do not look very specific:

  nt': 'https://foundation.wikimedia.org/wiki/Special:MyLanguage/Policy:Cookie_statement'}
    Kostenpflichtige Colab-Produkte - Hier können Sie Verträge kündigen

what is really aimed – is to geth the list like so:

https://de.wikipedia.org/wiki/Abenberg
https://de.wikipedia.org/wiki/Abensberg
https://de.wikipedia.org/wiki/Absberg
https://de.wikipedia.org/wiki/Abtswind

btw: on a sidenote: on the above mentioned subpages i have information in the infobox – which i am able to gather. See an example:

import pandas
urlpage =  'https://de.wikipedia.org/wiki/Abenberg'
data = pandas.read_html(urlpage)[0]
null = data.isnull()

for x in range(len(data)):
    first = data.iloc[x][0]
    second = data.iloc[x][1] if not null.iloc[x][1] else ""
    print(first,second,"\n")

which runs perfectly see the output:

Basisdaten Basisdaten 
Koordinaten: 49° 15′ N, 10° 58′ OKoordinaten: 49° 15′ N, 10° 58′ O 
Bundesland: Bayern 
Regierungsbezirk: Mittelfranken 
Landkreis: Roth 
Höhe: 414 m ü. NHN 
Fläche: 48,41 km2 
Einwohner: 5607 (31. Dez. 2022)[1] 
Bevölkerungsdichte: 116 Einwohner je km2 
Postleitzahl: 91183 
Vorwahl: 09178 
Kfz-Kennzeichen: RH, HIP 
Gemeindeschlüssel: 09 5 76 111 
LOCODE: ABR 
Stadtgliederung: 14 Gemeindeteile 
Adresse der  Stadtverwaltung: Stillaplatz 1  91183 Abenberg 
Website: www.abenberg.de 
Erste Bürgermeisterin: Susanne König (parteilos) 
Lage der Stadt Abenberg im Landkreis Roth Lage der Stadt Abenberg im Landkreis Roth

And that said i found out that the infobox is a typical wiki-part. so if i get familiar on this part – then i have learned alot – for future tasks – not only for me but for many others more that are diving into the Topos of scraping-wiki pages. So this might be a general task – helpful and packed with lots of information for many others too.

so far so good: i have a list with pages that lead to quite a many infoboxes:
https://de.wikipedia.org/wiki/Liste_der_St%C3%A4dte_und_Gemeinden_in_Bayern#A

i think its worth to traverse over them – and fetch the infobox. the information you are looking for could be found with a python code that traverses over all the findindgs

https://de.wikipedia.org/wiki/Abenberg
https://de.wikipedia.org/wiki/Abensberg
https://de.wikipedia.org/wiki/Absberg
https://de.wikipedia.org/wiki/Abtswind

….and so on and so forth – note: with that i would be able to traverse my above mentioned scraper that is able to fetch the data of one info-box.

>Solution :

Your selector is wrong.

The names of towns are in a tag which is in li tag which in turn is under a div with class column-multiple.

First, get all divs with class column-multiple and then get all the li items from the gathered divs and then get the href attribute of all the a tags inside.

url = "https://de.wikipedia.org/wiki/Liste_der_St%C3%A4dte_und_Gemeinden_in_Bayern"
response = requests.get(url)
if response.status_code == 200:
    soup = BeautifulSoup(response.content, 'html.parser')
    #find all the div elemnts with class column-multiple
    divs = soup.find_all('div', class_='column-multiple')
    href_list = []
    for div in divs:
        # Find all li elements within the div.column-multiple
        li_items = div.find_all('li')
        for li in li_items:
            #now get the href of all <a> tags in li items
            a_tags = li.find_all('a', href=True)
            href_list.extend([a['href'] for a in a_tags])
    for href in href_list:
        print(f"https://de.wikipedia.org{href}")

It will print what you want:

https://de.wikipedia.org/wiki/Amberg
https://de.wikipedia.org/wiki/Ansbach
https://de.wikipedia.org/wiki/Aschaffenburg
https://de.wikipedia.org/wiki/Augsburg
https://de.wikipedia.org/wiki/Bamberg
.
.
.