Follow

Keep Up to Date with the Most Important News

By pressing the Subscribe button, you confirm that you have read and are agreeing to our Privacy Policy and Terms of Use
Contact

Understanding the context of my json.load/re.search's syntax

I recently finished a WebScrapping/Automation Zillow program for my boot camp. Instructor encouraged google as I was having issues with only being able to get the first couple of listing.

I stumbled upon this answer: Zillow web scraping using Selenium & BeautifulSoup

This worked well since instead of using bs4’s find all method, I was able to get all of my listing neatly placed in a JSON file which was much easier to go through and complete the project. I only recently learned about regex and the re module on python and I was wondering if someone can explain how this code worked to help me retrieve the the nicely listed JSON from the get response and if this would work for other websites?

MEDevel.com: Open-source for Healthcare and Education

Collecting and validating open-source software for healthcare, education, enterprise, development, medical imaging, medical records, and digital pathology.

Visit Medevel

Code was:

self.data = json.loads(re.search(r'!--(\{"queryState".*?)-->', self.response.text).group(1))
  1. What arguments was taken account for on the json.loads?
  2. How did the oddly written !--({"queryState".*?)--> work?
  3. What is the purpose of the .group(1)?

I hate just copy and pasting but somehow this worked like magic and Id like to know how to replicate this for future projects. Sorry if this is loaded but the re.search documentation wasn’t as helpful as I thought.

>Solution :

  1. json.loads() can work with a single argument, a string that will be parsed as JSON and the return value is typically a dictionary or list (depending on the JSON). Here, that single string is the return value of the call to .group(1)
  2. How is r'!--(\{"queryState".*?)-->' oddly written? It is a regular expression that is being applied to self.response.text using re.search(). It looks for the literal !-- and --> followed by something starting with {"queryState". The \ is there to indicated that the { is to be matched literally as well. The .*? indicates "any character zero or more times, not greedily (to avoid matching --> as part of it).
  3. .group(1) returns the first matched group in the regex, which is the first part in parentheses. In this case, anything in between !-- and -->, if it starts with {"queryState"

So, if self.response.text would be this:

something
!--{"not queryState": 123}-->
something else
!--{"queryState": 123}-->
something else

Then running this:

self.data = json.loads(re.search(r'!--(\{"queryState".*?)-->', self.response.text).group(1))

Would set self.data to "{'queryState': 123}"

Add a comment

Leave a Reply

Keep Up to Date with the Most Important News

By pressing the Subscribe button, you confirm that you have read and are agreeing to our Privacy Policy and Terms of Use

Discover more from Dev solutions

Subscribe now to keep reading and get access to the full archive.

Continue reading