Follow

Keep Up to Date with the Most Important News

By pressing the Subscribe button, you confirm that you have read and are agreeing to our Privacy Policy and Terms of Use
Contact

I'm trying to extract emails, and I'm getting a TypeError

I’m attempting to take emails from 500 word documents, and use findall to extract them into excel. This is the code I have so far:

import pandas as pd
from docx.api import Document
import os
import re

os.chdir('C:\\Users\\user1\\test')
path = 'C:\\Users\\user1\\test'
output_path = 'C:\\Users\\user1\\test2'
writer = pd.ExcelWriter('{}/docx_emails.xlsx'.format(output_path),engine='xlsxwriter')


worddocs_list = []
for filename in list(os.listdir(path)):
    wordDoc = Document(os.path.join(path, filename))
    worddocs_list.append(wordDoc)

data = []    
    
for wordDoc in worddocs_list:
    match = re.findall(r'[\w.+-]+@[\w-]+\.[\w.-]+',wordDoc)
    data.append(match)
   

df = pd.DataFrame(data)
df.to_excel(writer)
writer.save()

print(df)

and I’m getting an error showing:

TypeError                                 Traceback (most recent call last)
Input In [6], in <cell line: 19>()
     17 data = []    
     19 for wordDoc in worddocs_list:
---> 20     match = re.findall(r'[\w.+-]+@[\w-]+\.[\w.-]+',wordDoc)
     21     data.append(match)
     24 df = pd.DataFrame(data)

File ~\anaconda3\lib\re.py:241, in findall(pattern, string, flags)
    233 def findall(pattern, string, flags=0):
    234     """Return a list of all non-overlapping matches in the string.
    235 
    236     If one or more capturing groups are present in the pattern, return
   (...)
    239 
    240     Empty matches are included in the result."""
--> 241     return _compile(pattern, flags).findall(string)

TypeError: expected string or bytes-like object

What am I doing wrong here?

MEDevel.com: Open-source for Healthcare and Education

Collecting and validating open-source software for healthcare, education, enterprise, development, medical imaging, medical records, and digital pathology.

Visit Medevel

Many thanks.

>Solution :

Your wordDoc variable doesn’t contain a string, it contains a Document object. You need to look at the docx.api documention to see how to get the body of the Word document as a string out of the object.

It looks like you first have to get the Paragraphs with wordDoc.paragraphs and then ask each one for its text, so maybe something like this?

documentText = '\n'.join([p.text for p in wordDoc.paragraphs])
match = re.findall(r'[\w.+-]+@[\w-]+\.[\w.-]+', documentText)

If you’re going to be using the same regular expression over and over, though, you should probably compile it into a Pattern object first instead of passing it as a string to findall every time.

regex = re.compile(r'[\w.+-]+@[\w-]+\.[\w.-]+')
for filename in list(os.listdir(path)):
    wordDoc = Document(os.path.join(path, filename))
    documentText = '\n'.join([p.text for p in wordDoc.paragraphs])
    match = regex.findall(documentText)
    
Add a comment

Leave a Reply

Keep Up to Date with the Most Important News

By pressing the Subscribe button, you confirm that you have read and are agreeing to our Privacy Policy and Terms of Use

Discover more from Dev solutions

Subscribe now to keep reading and get access to the full archive.

Continue reading