Follow

Keep Up to Date with the Most Important News

By pressing the Subscribe button, you confirm that you have read and are agreeing to our Privacy Policy and Terms of Use
Contact

How can to get the JSON out of webpage?

So I’m trying to parse data out of this webpage

But I don’t need the whole dataset, I just need:

  • The operator name (Google, CloudFlare, etc.)
  • The description (Google ‘Argon2022’ log, Google ‘Argon2023’ log, etc.)
  • The logIDs (KXm+8J45OSHwVnOfY6V35b5XfZxgCvj5TV0mXCVdx4Q=)

I tried to write some code but I’m just a beginner at webscraping, so was wondering if anyone could help. Here is my attempted code, I tried using lxml and requests library.

MEDevel.com: Open-source for Healthcare and Education

Collecting and validating open-source software for healthcare, education, enterprise, development, medical imaging, medical records, and digital pathology.

Visit Medevel

import requests
from lxml import html

page = requests.get('https://chromium.googlesource.com/chromium/src/+/main/components/certificate_transparency/data/log_list.json')
tree = html.fromstring(page.content)

#This will create a list of operators:
operators = tree.xpath('//span[@class="operators"]/text()')

print('Operators: ',operators)

My hope is to have an end result that looks like the JSON on the website minus all the unneeded info so operators:

[
  { "name": "Google",
    "logs": [
      { description: "Google Argon2022 log",
        log_id: "KXm+8J45OSHwVnOfY6V35b5XfZxgCvj5TV0mXCVdx4Q=" }, 
      { description: "GoogleArgon2023 log",
        log_id: "6D7Q2j71BjUy51covIlryQPTy9ERa+zraeF3fW0GvW4=" }
  }
  ....
  { "name": "CloudFlare",
    "logs": [ ... ]
  }
]

>Solution :

First, you want to access the raw file, and not the UI. Just like Kache mentioned, you can get the JSON using:

resp = requests.get('https://chromium.googlesource.com/chromium/src/+/main/components/certificate_transparency/data/log_list.json?format=TEXT')
obj = json.loads(base64.decodebytes(resp.text.encode()))

Then, you can use the following script to extract only the data you want:

import requests
import json
import base64

def extract_log(log):
    keys = [ 'description', 'log_id' ]
    return { key: log[key] for key in keys }

def extract_logs(logs):
    return [ extract_log(log) for log in logs ]

def extract_operator(operator):
    return {
        'name': operator['name'],
        'logs': extract_logs(operator['logs'])
    }

def extract_certificates(obj):
    return [ extract_operator(operator) for operator in obj['operators'] ]

def scrape_certificates(url):
    resp = requests.get(url)
    obj = json.loads(base64.decodebytes(resp.text.encode()))
    return extract_certificates(obj)

def main():
    out = scrape_certificates('https://chromium.googlesource.com/chromium/src/+/main/components/certificate_transparency/data/log_list.json?format=TEXT')
    print(json.dumps(out, indent=4))

if __name__ == '__main__':
    main()
Add a comment

Leave a Reply

Keep Up to Date with the Most Important News

By pressing the Subscribe button, you confirm that you have read and are agreeing to our Privacy Policy and Terms of Use

Discover more from Dev solutions

Subscribe now to keep reading and get access to the full archive.

Continue reading