Follow

Keep Up to Date with the Most Important News

By pressing the Subscribe button, you confirm that you have read and are agreeing to our Privacy Policy and Terms of Use
Contact

Get links from summary section of wikipedia page: extract all the links from this wiki-page with Python

howdy i am trying to scrape all the Links of a large wikpedia page from the "List of Towns and Gemeinden in Bayern" on Wikipedia using python. The trouble is that I cannot figure out how to export all of the links containing the words "/wiki/" to my CSV file. I am used to Python a bit but some things are still kinda of foreign to me. Any ideas? Here is what I have so far…

the page: https://de.wikipedia.org/wiki/Liste_der_St%C3%A4dte_und_Gemeinden_in_Bayern#A

from bs4 import BeautifulSoup as bs
import requests

res = requests.get("https://en.wikipedia.org/wiki/Liste_der_St%C3%A4dte_und_Gemeinden_in_Bayern#A")
soup = bs(res.text, "html.parser")
gemeinden_in_bayern = {}
for link in soup.find_all("a"):
    url = link.get("href", "")
    if "/wiki/" in url:
        gemeinden_in_bayern[link.text.strip()] = url

print(gemeinden_in_bayern)

the results do not look very specific:

MEDevel.com: Open-source for Healthcare and Education

Collecting and validating open-source software for healthcare, education, enterprise, development, medical imaging, medical records, and digital pathology.

Visit Medevel

  nt': 'https://foundation.wikimedia.org/wiki/Special:MyLanguage/Policy:Cookie_statement'}
    Kostenpflichtige Colab-Produkte - Hier können Sie Verträge kündigen

what is really aimed – is to geth the list like so:

https://de.wikipedia.org/wiki/Abenberg
https://de.wikipedia.org/wiki/Abensberg
https://de.wikipedia.org/wiki/Absberg
https://de.wikipedia.org/wiki/Abtswind

btw: on a sidenote: on the above mentioned subpages i have information in the infobox – which i am able to gather. See an example:

import pandas
urlpage =  'https://de.wikipedia.org/wiki/Abenberg'
data = pandas.read_html(urlpage)[0]
null = data.isnull()

for x in range(len(data)):
    first = data.iloc[x][0]
    second = data.iloc[x][1] if not null.iloc[x][1] else ""
    print(first,second,"\n")

which runs perfectly see the output:

Basisdaten Basisdaten 
Koordinaten: 49° 15′ N, 10° 58′ OKoordinaten: 49° 15′ N, 10° 58′ O 
Bundesland: Bayern 
Regierungsbezirk: Mittelfranken 
Landkreis: Roth 
Höhe: 414 m ü. NHN 
Fläche: 48,41 km2 
Einwohner: 5607 (31. Dez. 2022)[1] 
Bevölkerungsdichte: 116 Einwohner je km2 
Postleitzahl: 91183 
Vorwahl: 09178 
Kfz-Kennzeichen: RH, HIP 
Gemeindeschlüssel: 09 5 76 111 
LOCODE: ABR 
Stadtgliederung: 14 Gemeindeteile 
Adresse der  Stadtverwaltung: Stillaplatz 1  91183 Abenberg 
Website: www.abenberg.de 
Erste Bürgermeisterin: Susanne König (parteilos) 
Lage der Stadt Abenberg im Landkreis Roth Lage der Stadt Abenberg im Landkreis Roth 

And that said i found out that the infobox is a typical wiki-part. so if i get familiar on this part – then i have learned alot – for future tasks – not only for me but for many others more that are diving into the Topos of scraping-wiki pages. So this might be a general task – helpful and packed with lots of information for many others too.

so far so good: i have a list with pages that lead to quite a many infoboxes:
https://de.wikipedia.org/wiki/Liste_der_St%C3%A4dte_und_Gemeinden_in_Bayern#A

i think its worth to traverse over them – and fetch the infobox. the information you are looking for could be found with a python code that traverses over all the findindgs

https://de.wikipedia.org/wiki/Abenberg
https://de.wikipedia.org/wiki/Abensberg
https://de.wikipedia.org/wiki/Absberg
https://de.wikipedia.org/wiki/Abtswind

….and so on and so forth – note: with that i would be able to traverse my above mentioned scraper that is able to fetch the data of one info-box.

>Solution :

Your selector is wrong.

The names of towns are in a tag which is in li tag which in turn is under a div with class column-multiple.

First, get all divs with class column-multiple and then get all the li items from the gathered divs and then get the href attribute of all the a tags inside.

url = "https://de.wikipedia.org/wiki/Liste_der_St%C3%A4dte_und_Gemeinden_in_Bayern"
response = requests.get(url)
if response.status_code == 200:
    soup = BeautifulSoup(response.content, 'html.parser')
    #find all the div elemnts with class column-multiple
    divs = soup.find_all('div', class_='column-multiple')
    href_list = []
    for div in divs:
        # Find all li elements within the div.column-multiple
        li_items = div.find_all('li')
        for li in li_items:
            #now get the href of all <a> tags in li items
            a_tags = li.find_all('a', href=True)
            href_list.extend([a['href'] for a in a_tags])
    for href in href_list:
        print(f"https://de.wikipedia.org{href}")

It will print what you want:

https://de.wikipedia.org/wiki/Amberg
https://de.wikipedia.org/wiki/Ansbach
https://de.wikipedia.org/wiki/Aschaffenburg
https://de.wikipedia.org/wiki/Augsburg
https://de.wikipedia.org/wiki/Bamberg
.
.
.
Add a comment

Leave a Reply

Keep Up to Date with the Most Important News

By pressing the Subscribe button, you confirm that you have read and are agreeing to our Privacy Policy and Terms of Use

Discover more from Dev solutions

Subscribe now to keep reading and get access to the full archive.

Continue reading