I'm trying to extract emails, and I'm getting a TypeError

I’m attempting to take emails from 500 word documents, and use findall to extract them into excel. This is the code I have so far:

import pandas as pd
from docx.api import Document
import os
import re

path = 'C:\\Users\\user1\\test'
output_path = 'C:\\Users\\user1\\test2'
writer = pd.ExcelWriter('{}/docx_emails.xlsx'.format(output_path),engine='xlsxwriter')

worddocs_list = []
for filename in list(os.listdir(path)):
    wordDoc = Document(os.path.join(path, filename))

data = []    
for wordDoc in worddocs_list:
    match = re.findall(r'[\w.+-]+@[\w-]+\.[\w.-]+',wordDoc)

df = pd.DataFrame(data)


and I’m getting an error showing:

TypeError                                 Traceback (most recent call last)
Input In [6], in <cell line: 19>()
     17 data = []    
     19 for wordDoc in worddocs_list:
---> 20     match = re.findall(r'[\w.+-]+@[\w-]+\.[\w.-]+',wordDoc)
     21     data.append(match)
     24 df = pd.DataFrame(data)

File ~\anaconda3\lib\re.py:241, in findall(pattern, string, flags)
    233 def findall(pattern, string, flags=0):
    234     """Return a list of all non-overlapping matches in the string.
    236     If one or more capturing groups are present in the pattern, return
    240     Empty matches are included in the result."""
--> 241     return _compile(pattern, flags).findall(string)

TypeError: expected string or bytes-like object

What am I doing wrong here?

Many thanks.

>Solution :

Your wordDoc variable doesn’t contain a string, it contains a Document object. You need to look at the docx.api documention to see how to get the body of the Word document as a string out of the object.

It looks like you first have to get the Paragraphs with wordDoc.paragraphs and then ask each one for its text, so maybe something like this?

documentText = '\n'.join([p.text for p in wordDoc.paragraphs])
match = re.findall(r'[\w.+-]+@[\w-]+\.[\w.-]+', documentText)

If you’re going to be using the same regular expression over and over, though, you should probably compile it into a Pattern object first instead of passing it as a string to findall every time.

regex = re.compile(r'[\w.+-]+@[\w-]+\.[\w.-]+')
for filename in list(os.listdir(path)):
    wordDoc = Document(os.path.join(path, filename))
    documentText = '\n'.join([p.text for p in wordDoc.paragraphs])
    match = regex.findall(documentText)

Leave a Reply