I’m writing a script to go through a list of root urls and find email addresses. Sometimes it returns no results. I’ve accounted for this in the code, and have followed the instructions on the answers to this question on SO to fix it, but cannot seem to figure it out.
First I’m pulling in a list of URLs:
url_list_updated=
['http://www.gfcadvice.com/',
'https://trillionfinancial.com.sg/about-us/',
'https://www.gen.com.sg/',
'https://www.aam-advisory.com/',
'https://www.proinvest.com.sg/',
'http://www.gilbertkoh.com/',
'https://dollarbureau.com/',
'http://www.greenfieldadvisory.com/',
'https://enpointefinancial.com/',
'https://www.ippfa.com/']
Then, I’m using BeautifulSoup to find 'mailto:' and returning lists of those results:
for url in url_list_updated:
response = requests.get(url)
html_content = response.text
soup = BeautifulSoup(html_content, 'html.parser')
email_addresses = []
for link in soup.find_all('a'):
# if 'mailto:' != None and 'mailto:' in link.get('href'):
# if 'mailto:' != '' and 'mailto:' in link.get('href'):
# if 'mailto:' in link.get('href') != None:
if 'mailto:' in link.get('href') != '':
email_addresses.append(link.get('href').replace('mailto:', ''))
print(email_addresses)
else:
pass
I know that some of the results will be empty because not every website has 'mailto:' info visible, so I’ve followed multiple solutions on SO for NoneType (which I have commented out for reference)
The traceback always gives me this same result, even when I’m accounting for the missing results.
7 email_addresses = []
8 for link in soup.find_all('a'):
9 # if 'mailto:' != None and 'mailto:' in link.get('href'):
10 # if 'mailto:' != '' and 'mailto:' in link.get('href'):
11 # if 'mailto:' in link.get('href') != None:
---> 12 if 'mailto:' in link.get('href') != '':
13 email_addresses.append(link.get('href').replace('mailto:', ''))
14 print(email_addresses)
TypeError: argument of type 'NoneType' is not iterable
What should I do differently?
>Solution :
The issue is the way you check it.
You are trying to check if a string is in something, and use that to also check if it’s different than ''. The first operation will always return a bool (or an error in this case) and thus, failing to collect the emails.
href = link.get('href')
if href is not None and 'mailto:' in href:
email_addresses.append(href.replace('mailto:', ''))