howdy i am trying to scrape all the Links of a large wikpedia page from the "List of Towns and Gemeinden in Bayern" on Wikipedia using python. The trouble is that I cannot figure out how to export all of the links containing the words "/wiki/" to my CSV file. I am used to Python a bit but some things are still kinda of foreign to me. Any ideas? Here is what I have so far…
the page: https://de.wikipedia.org/wiki/Liste_der_St%C3%A4dte_und_Gemeinden_in_Bayern#A
from bs4 import BeautifulSoup as bs
import requests
res = requests.get("https://en.wikipedia.org/wiki/Liste_der_St%C3%A4dte_und_Gemeinden_in_Bayern#A")
soup = bs(res.text, "html.parser")
gemeinden_in_bayern = {}
for link in soup.find_all("a"):
url = link.get("href", "")
if "/wiki/" in url:
gemeinden_in_bayern[link.text.strip()] = url
print(gemeinden_in_bayern)
the results do not look very specific:
nt': 'https://foundation.wikimedia.org/wiki/Special:MyLanguage/Policy:Cookie_statement'}
Kostenpflichtige Colab-Produkte - Hier können Sie Verträge kündigen
what is really aimed – is to geth the list like so:
https://de.wikipedia.org/wiki/Abenberg
https://de.wikipedia.org/wiki/Abensberg
https://de.wikipedia.org/wiki/Absberg
https://de.wikipedia.org/wiki/Abtswind
btw: on a sidenote: on the above mentioned subpages i have information in the infobox – which i am able to gather. See an example:
import pandas
urlpage = 'https://de.wikipedia.org/wiki/Abenberg'
data = pandas.read_html(urlpage)[0]
null = data.isnull()
for x in range(len(data)):
first = data.iloc[x][0]
second = data.iloc[x][1] if not null.iloc[x][1] else ""
print(first,second,"\n")
which runs perfectly see the output:
Basisdaten Basisdaten
Koordinaten: 49° 15′ N, 10° 58′ OKoordinaten: 49° 15′ N, 10° 58′ O
Bundesland: Bayern
Regierungsbezirk: Mittelfranken
Landkreis: Roth
Höhe: 414 m ü. NHN
Fläche: 48,41 km2
Einwohner: 5607 (31. Dez. 2022)[1]
Bevölkerungsdichte: 116 Einwohner je km2
Postleitzahl: 91183
Vorwahl: 09178
Kfz-Kennzeichen: RH, HIP
Gemeindeschlüssel: 09 5 76 111
LOCODE: ABR
Stadtgliederung: 14 Gemeindeteile
Adresse der Stadtverwaltung: Stillaplatz 1 91183 Abenberg
Website: www.abenberg.de
Erste Bürgermeisterin: Susanne König (parteilos)
Lage der Stadt Abenberg im Landkreis Roth Lage der Stadt Abenberg im Landkreis Roth
And that said i found out that the infobox is a typical wiki-part. so if i get familiar on this part – then i have learned alot – for future tasks – not only for me but for many others more that are diving into the Topos of scraping-wiki pages. So this might be a general task – helpful and packed with lots of information for many others too.
so far so good: i have a list with pages that lead to quite a many infoboxes:
https://de.wikipedia.org/wiki/Liste_der_St%C3%A4dte_und_Gemeinden_in_Bayern#A
i think its worth to traverse over them – and fetch the infobox. the information you are looking for could be found with a python code that traverses over all the findindgs
https://de.wikipedia.org/wiki/Abenberg
https://de.wikipedia.org/wiki/Abensberg
https://de.wikipedia.org/wiki/Absberg
https://de.wikipedia.org/wiki/Abtswind
….and so on and so forth – note: with that i would be able to traverse my above mentioned scraper that is able to fetch the data of one info-box.
>Solution :
Your selector is wrong.
The names of towns are in a tag which is in li tag which in turn is under a div with class column-multiple.
First, get all divs with class column-multiple and then get all the li items from the gathered divs and then get the href attribute of all the a tags inside.
url = "https://de.wikipedia.org/wiki/Liste_der_St%C3%A4dte_und_Gemeinden_in_Bayern"
response = requests.get(url)
if response.status_code == 200:
soup = BeautifulSoup(response.content, 'html.parser')
#find all the div elemnts with class column-multiple
divs = soup.find_all('div', class_='column-multiple')
href_list = []
for div in divs:
# Find all li elements within the div.column-multiple
li_items = div.find_all('li')
for li in li_items:
#now get the href of all <a> tags in li items
a_tags = li.find_all('a', href=True)
href_list.extend([a['href'] for a in a_tags])
for href in href_list:
print(f"https://de.wikipedia.org{href}")
It will print what you want:
https://de.wikipedia.org/wiki/Amberg
https://de.wikipedia.org/wiki/Ansbach
https://de.wikipedia.org/wiki/Aschaffenburg
https://de.wikipedia.org/wiki/Augsburg
https://de.wikipedia.org/wiki/Bamberg
.
.
.