Follow

Keep Up to Date with the Most Important News

By pressing the Subscribe button, you confirm that you have read and are agreeing to our Privacy Policy and Terms of Use
Contact

How to decode the string encoded by document.write in BeautifulSoup, Python?

As title says, I’m stuck here for hours with no documentation or any solution.
This is the website where I started: https://idhsaa.org/directory. I cannot access the Email IDs not only over here, but also inside the individual websites that opens up upon clicking on the school names.

The format that I found is something like this:

<p>
    <script>
        document.write(window.atob('PGEgaHJlZj0nbWFpbHRvOnNwZWNrZXI3M0B5YWhvby5jb20nPkVtYWlsPC9hPg=='));
    </script>
    <a href="mailto:pincockt@aberdeen58.org">Email</a>
    </br>
</p>

I managed to get the encoded code that looks something like this:

MEDevel.com: Open-source for Healthcare and Education

Collecting and validating open-source software for healthcare, education, enterprise, development, medical imaging, medical records, and digital pathology.

Visit Medevel

mailto:<script>document.write(window.atob('PGEgaHJlZj0nbWFpbHRvOmFkbWluQGlkaHNhYS5vcmcnPmFkbWluQGlkaHNhYS5vcmc8L2E+'));</script>

The question is, how do I decode this to get the Email IDs?
Depending on what I saw in the above output that I got, I assume, I need to decode that to get the actual email.

Here’s the code that I’d been working on:

def url_parser(url):
    headers = {
        "User-Agent": 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/42.0.2311.135 Safari/537.36 Edge/12.246',
    }
    html_doc = requests.get(url, headers=headers).text
    soup = BeautifulSoup(html_doc, 'html.parser')
    return soup


def data_fetch(url):
    soup = url_parser(url)
    table = soup.find('table').find('tbody')
    rows = table.find_all('tr')

    data = []
    for row in rows:
        school_name = row.find_all('a')

        for school in school_name:
            if 'school?' in school.get('href'):
                school_website = url.replace('/directory', f'/{school_web_id}')

                school_site = url_parser(school_website)
                principal_email_encoded = school_site.find_all('a')
                for principal_email in principal_email_encoded:
                    email = principal_email.get('href')
                    if 'maito:<script>' in email:
                        print(email.replace('maito:<script>', '').replace(';</script>', ''))



def main():
    url = "https://idhsaa.org/directory"
    data_fetch(url)


if __name__ == "__main__":
    main()

>Solution :

These are base64 encoded strings, you can decode the value by using the base64 module included in the Python Standard library.

For example, after extracting the encoded string you can do the following:

import base64
encoded_str = "PGEgaHJlZj0nbWFpbHRvOmFkbWluQGlkaHNhYS5vcmcnPmFkbWluQGlkaHNhYS5vcmc8L2E+"
decoded_html = base64.b64decode(encoded_str).decode("utf-8")
print(decoded_html)

Output:

"<a href='mailto:admin@idhsaa.org'>admin@idhsaa.org</a>"
Add a comment

Leave a Reply

Keep Up to Date with the Most Important News

By pressing the Subscribe button, you confirm that you have read and are agreeing to our Privacy Policy and Terms of Use

Discover more from Dev solutions

Subscribe now to keep reading and get access to the full archive.

Continue reading