Follow

Keep Up to Date with the Most Important News

By pressing the Subscribe button, you confirm that you have read and are agreeing to our Privacy Policy and Terms of Use
Contact

TypeError: XXXXX got an unexpected keyword argument 'XXXXXX'

I’m getting an unexpected keyword argument from running a code. Source : https://sempioneer.com/python-for-seo/how-to-extract-text-from-multiple-webpages-in-python/
Anybody can help ? thanks

running below code :

single_url = 'https://understandingdata.com/'
text = extract_text_from_single_web_page(url=single_url)
print(text)

gives below error :

MEDevel.com: Open-source for Healthcare and Education

Collecting and validating open-source software for healthcare, education, enterprise, development, medical imaging, medical records, and digital pathology.

Visit Medevel

---------------------------------------------------------------------------
TypeError                                 Traceback (most recent call last)
~\AppData\Local\Temp/ipykernel_10260/3606377172.py in <module>
      1 single_url = 'https://understandingdata.com/'
----> 2 text = extract_text_from_single_web_page(url=single_url)
      3 print(text)

~\AppData\Local\Temp/ipykernel_10260/850098094.py in extract_text_from_single_web_page(url)
     42     try:
     43         a = trafilatura.extract(downloaded_url, json_output=True, with_metadata=True, include_comments = False,
---> 44                             date_extraction_params={'extensive_search': True, 'original_date': True})
     45     except AttributeError:
     46         a = trafilatura.extract(downloaded_url, json_output=True, with_metadata=True,

TypeError: extract() got an unexpected keyword argument 'json_output'

the code for "extract_text_from_single_web_page(url=single_url)

def extract_text_from_single_web_page(url):
    
    downloaded_url = trafilatura.fetch_url(url)
    try:
        a = trafilatura.extract(downloaded_url, json_output=True, with_metadata=True, include_comments = False,
                            date_extraction_params={'extensive_search': True, 'original_date': True})
    except AttributeError:
        a = trafilatura.extract(downloaded_url, json_output=True, with_metadata=True,
                            date_extraction_params={'extensive_search': True, 'original_date': True})
    if a:
        json_output = json.loads(a)
        return json_output['text']
    else:
        try:
            resp = requests.get(url)
            # We will only extract the text from successful requests:
            if resp.status_code == 200:
                return beautifulsoup_extract_text_fallback(resp.content)
            else:
                # This line will handle for any failures in both the Trafilature and BeautifulSoup4 functions:
                return np.nan
        # Handling for any URLs that don't have the correct protocol
        except MissingSchema:
            return np.nan

>Solution :

As suggested in my comment, the best option is to find a tutorial that doesn’t use trafilatura, since that seems to be the thing that’s broken. However, it’s pretty simple to modify this particular function to avoid it and just use the fallback:

def extract_text_from_single_web_page(url):
    try:
        resp = requests.get(url)
        # We will only extract the text from successful requests:
        if resp.status_code == 200:
            return beautifulsoup_extract_text_fallback(resp.content)
        else:
            # This line will handle for any failures in the BeautifulSoup4 function:
            return np.nan
    # Handling for any URLs that don't have the correct protocol
    except MissingSchema:
        return np.nan
Add a comment

Leave a Reply

Keep Up to Date with the Most Important News

By pressing the Subscribe button, you confirm that you have read and are agreeing to our Privacy Policy and Terms of Use

Discover more from Dev solutions

Subscribe now to keep reading and get access to the full archive.

Continue reading