Beautifulsoup doesn't get all mailto href

February 12, 2022

I am trying to find all emails on a website with the following code:

import requests
from bs4 import BeautifulSoup

website = 'http://abborup.dk/sidsteny/lejligheder-til-salg/'
response = requests.get(website)
soup = BeautifulSoup(response.content, 'html.parser')
Email_list = []
for email in soup.select('a[href^=mailto]'):
    data = email['href']
    data = data.split('?')[0]
    data = data.replace('mailto:', '')
    Email_list.append(data)

The problem is that i do not get all of the mailto emails from the site, any ideas what i’m doing wrong?

>Solution :

It looks like not all items are actually mailto’s directly in the raw page source but generated by JavaScript.

You might be better off just regex’ing, something like:

import requests
import re

r = requests.get('http://abborup.dk/sidsteny/lejligheder-til-salg/')
emails = [
    '{}@{}'.format(*el) 
    for el in re.findall('var username = "(.*?)"; var hostname = "(.*?)"', r.text)
]

I wouldn’t hold out much hope in the robustness or elegance of this approach but seems to work for your example.