I am trying to find all emails on a website with the following code:
import requests
from bs4 import BeautifulSoup
website = 'http://abborup.dk/sidsteny/lejligheder-til-salg/'
response = requests.get(website)
soup = BeautifulSoup(response.content, 'html.parser')
Email_list = []
for email in soup.select('a[href^=mailto]'):
data = email['href']
data = data.split('?')[0]
data = data.replace('mailto:', '')
Email_list.append(data)
The problem is that i do not get all of the mailto emails from the site, any ideas what i’m doing wrong?
>Solution :
It looks like not all items are actually mailto’s directly in the raw page source but generated by JavaScript.
You might be better off just regex’ing, something like:
import requests
import re
r = requests.get('http://abborup.dk/sidsteny/lejligheder-til-salg/')
emails = [
'{}@{}'.format(*el)
for el in re.findall('var username = "(.*?)"; var hostname = "(.*?)"', r.text)
]
I wouldn’t hold out much hope in the robustness or elegance of this approach but seems to work for your example.