Follow

Keep Up to Date with the Most Important News

By pressing the Subscribe button, you confirm that you have read and are agreeing to our Privacy Policy and Terms of Use
Contact

Regex escape multiple quotation

I have bunch of crawled endpoint result from a domain, it was parsed in Json format. I’ve successfully agregate the json to extract it’s endpoint using sed and i wanted to merge all the sed command into my python crawl script. Here’s the crawl output :

{'Results': [{'Result': {'IsDB': 'True', 'Spend': 367, 'Paths': [{'Technologies': [{'Categories': ['Japan hosting'], 'Name': 'Internet Initiative Japan','Link': 'https://www.iij.ad.jp'},{'Name': 'GlobalSign Domain Verification', 'Link': 'https://support.globalsign.com/customer/portal/articles/2167245-performing-domain-verification---dns-txt-record'}]}]}]}

The reason why i use regex instead of jq to resolve the json is: sometimes the json has invalid format and the single-quotation ' ' raises other exception.

The problem is; i had to use os module to execute bash command wich the sed command itself has inner quotation. Here’s the implementation:

MEDevel.com: Open-source for Healthcare and Education

Collecting and validating open-source software for healthcare, education, enterprise, development, medical imaging, medical records, and digital pathology.

Visit Medevel

temp = "testi.txt"
os.system(f"sed -i 's/http:\/\//\nhttp:\/\//g' {temp}")
os.system(f"sed -i 's/https:\/\//\nhttps:\/\//g' {temp}")
os.system(f"sed -i 's/Tag/\\nTag/g' {temp}")
os.system(f"sed -i '/Tag/d' {temp}")
os.system(f"sed -i '/Result/d' {temp}")
os.system(f"sed -i 's/\', \'*$//' {temp}")
os.system(f"sed -i 's/^http:\/\///' {temp}")
os.system(f"sed -i 's/^https:\/\///' {temp}")
os.system(f"sed -i 's/\/.*//' {temp}")

As expected these 3 sed command are breaker, resulting in yet another mess:

os.system(f"sed -i 's/http:\/\//\nhttp:\/\//g' {temp}")
os.system(f"sed -i 's/https:\/\//\nhttps:\/\//g' {temp}") 
os.system(f"sed -i 's/\', \'*$//' {temp}")

I have tried to escape the newline unicode \n and inner quotation ', ' with double backslash but didn’t work.

's/http:\/\//\\nhttp:\/\//g'
's/https:\/\//\\nhttps:\/\//g' 
's/\\', \\'*$//'

Another reason to use sed is because; the target file has multi-lines inside, so i thought it was more easy to use sed instead of read each line using python. Any help would be cherished…

>Solution :

You seem to be looking simply for

import re
urls = re.findall(r'(?<=")https?://[^"]+(?=")', text)
Add a comment

Leave a Reply

Keep Up to Date with the Most Important News

By pressing the Subscribe button, you confirm that you have read and are agreeing to our Privacy Policy and Terms of Use

Discover more from Dev solutions

Subscribe now to keep reading and get access to the full archive.

Continue reading