Regex escape multiple quotation

I have bunch of crawled endpoint result from a domain, it was parsed in Json format. I’ve successfully agregate the json to extract it’s endpoint using sed and i wanted to merge all the sed command into my python crawl script. Here’s the crawl output :

{'Results': [{'Result': {'IsDB': 'True', 'Spend': 367, 'Paths': [{'Technologies': [{'Categories': ['Japan hosting'], 'Name': 'Internet Initiative Japan','Link': 'https://www.iij.ad.jp'},{'Name': 'GlobalSign Domain Verification', 'Link': 'https://support.globalsign.com/customer/portal/articles/2167245-performing-domain-verification---dns-txt-record'}]}]}]}

The reason why i use regex instead of jq to resolve the json is: sometimes the json has invalid format and the single-quotation ' ' raises other exception.

The problem is; i had to use os module to execute bash command wich the sed command itself has inner quotation. Here’s the implementation:

temp = "testi.txt"
os.system(f"sed -i 's/http:\/\//\nhttp:\/\//g' {temp}")
os.system(f"sed -i 's/https:\/\//\nhttps:\/\//g' {temp}")
os.system(f"sed -i 's/Tag/\\nTag/g' {temp}")
os.system(f"sed -i '/Tag/d' {temp}")
os.system(f"sed -i '/Result/d' {temp}")
os.system(f"sed -i 's/\', \'*$//' {temp}")
os.system(f"sed -i 's/^http:\/\///' {temp}")
os.system(f"sed -i 's/^https:\/\///' {temp}")
os.system(f"sed -i 's/\/.*//' {temp}")

As expected these 3 sed command are breaker, resulting in yet another mess:

os.system(f"sed -i 's/http:\/\//\nhttp:\/\//g' {temp}")
os.system(f"sed -i 's/https:\/\//\nhttps:\/\//g' {temp}") 
os.system(f"sed -i 's/\', \'*$//' {temp}")

I have tried to escape the newline unicode \n and inner quotation ', ' with double backslash but didn’t work.

's/http:\/\//\\nhttp:\/\//g'
's/https:\/\//\\nhttps:\/\//g' 
's/\\', \\'*$//'

Another reason to use sed is because; the target file has multi-lines inside, so i thought it was more easy to use sed instead of read each line using python. Any help would be cherished…

>Solution :

You seem to be looking simply for

import re
urls = re.findall(r'(?<=")https?://[^"]+(?=")', text)

Leave a Reply