Home How to decode the string encoded by document.write in BeautifulSoup, Python?

Questions

How to decode the string encoded by document.write in BeautifulSoup, Python?

October 2, 2022

As title says, I’m stuck here for hours with no documentation or any solution.
This is the website where I started: https://idhsaa.org/directory. I cannot access the Email IDs not only over here, but also inside the individual websites that opens up upon clicking on the school names.

The format that I found is something like this:

<p>
    <script>
        document.write(window.atob('PGEgaHJlZj0nbWFpbHRvOnNwZWNrZXI3M0B5YWhvby5jb20nPkVtYWlsPC9hPg=='));
    </script>
    <a href="mailto:pincockt@aberdeen58.org">Email</a>
    </br>
</p>

I managed to get the encoded code that looks something like this:

mailto:<script>document.write(window.atob('PGEgaHJlZj0nbWFpbHRvOmFkbWluQGlkaHNhYS5vcmcnPmFkbWluQGlkaHNhYS5vcmc8L2E+'));</script>

The question is, how do I decode this to get the Email IDs?
Depending on what I saw in the above output that I got, I assume, I need to decode that to get the actual email.

Here’s the code that I’d been working on:

def url_parser(url):
    headers = {
        "User-Agent": 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/42.0.2311.135 Safari/537.36 Edge/12.246',
    }
    html_doc = requests.get(url, headers=headers).text
    soup = BeautifulSoup(html_doc, 'html.parser')
    return soup


def data_fetch(url):
    soup = url_parser(url)
    table = soup.find('table').find('tbody')
    rows = table.find_all('tr')

    data = []
    for row in rows:
        school_name = row.find_all('a')

        for school in school_name:
            if 'school?' in school.get('href'):
                school_website = url.replace('/directory', f'/{school_web_id}')

                school_site = url_parser(school_website)
                principal_email_encoded = school_site.find_all('a')
                for principal_email in principal_email_encoded:
                    email = principal_email.get('href')
                    if 'maito:<script>' in email:
                        print(email.replace('maito:<script>', '').replace(';</script>', ''))



def main():
    url = "https://idhsaa.org/directory"
    data_fetch(url)


if __name__ == "__main__":
    main()

>Solution :

These are base64 encoded strings, you can decode the value by using the base64 module included in the Python Standard library.

For example, after extracting the encoded string you can do the following:

import base64
encoded_str = "PGEgaHJlZj0nbWFpbHRvOmFkbWluQGlkaHNhYS5vcmcnPmFkbWluQGlkaHNhYS5vcmc8L2E+"
decoded_html = base64.b64decode(encoded_str).decode("utf-8")
print(decoded_html)

Output:

"<a href='mailto:admin@idhsaa.org'>admin@idhsaa.org</a>"

python-requests

byMR

Published October 02, 2022

Add a comment

Selenium and bs4 can't retrieve element on web page

byMR

October 2, 2022

Questions

Making Independent Elements Appear on Seperate Divs with JS

byMR

October 2, 2022

Questions

iIterating over an std::vector two elements at a time plus last and first elements too

byMR

October 2, 2022

Questions

Select title of column with Pandas?

byMR

October 2, 2022

Questions

how to trim dot at the end of string in JS?

byMR

October 2, 2022

How to decode the string encoded by document.write in BeautifulSoup, Python?

MEDevel.com: Open-source for Healthcare and Education

>Solution :

Like this:

Leave a ReplyCancel reply

Read more

Selenium and bs4 can't retrieve element on web page

Making Independent Elements Appear on Seperate Divs with JS

iIterating over an std::vector two elements at a time plus last and first elements too

Select title of column with Pandas?

how to trim dot at the end of string in JS?

Keep Up to Date with the Most Important News

How to decode the string encoded by document.write in BeautifulSoup, Python?

MEDevel.com: Open-source for Healthcare and Education

>Solution :

Share this:

Like this:

Leave a ReplyCancel reply

Keep Up to Date with the Most Important News

Read more

Selenium and bs4 can't retrieve element on web page

Making Independent Elements Appear on Seperate Divs with JS

iIterating over an std::vector two elements at a time plus last and first elements too

Select title of column with Pandas?

how to trim dot at the end of string in JS?

Removing specific elements from a list in Python

Discover more from Dev solutions