Follow

Keep Up to Date with the Most Important News

By pressing the Subscribe button, you confirm that you have read and are agreeing to our Privacy Policy and Terms of Use
Contact

Extract keywords from links

I’m trying to extract the first 2 numbers in links like these:

https://primer.text.com/sdfg/8406758680-345386743-DSS1-S%20Jasd%12Odsfr%12Iwetds-Osdgf/ 
https://primer.text.com/sdfg/8945879094-849328844-DPE-S%20Jsdfe%12OIert-Isdfu/
https://primer.text.com/sdfg/8493093053-292494834-QW23%23Wsdfg%23Iprf%64Uiojn%32Asdfg-Werts/

The output should be like this:

id1 = ['8406758680', '8945879094','8493093053']
id2 = ['345386743', '849328844', '292494834']

I’m trying to do this using the re module.

MEDevel.com: Open-source for Healthcare and Education

Collecting and validating open-source software for healthcare, education, enterprise, development, medical imaging, medical records, and digital pathology.

Visit Medevel

Please, tell me how to do it.

This the code snippet I have so far:

def GetUrlClassId(UrlInPut):
    ClassID = ''
    for i in UrlInPut:
        if i.isdigit():
            ClassID+=i
        elif ClassID !='':
            return int(ClassID)
    return ""

def GetUrlInstanceID(UrlInPut):
    InstanceId = ''
    ClassID = 0
    for i in UrlInPut:
        if i.isdigit() and ClassID==1:
            InstanceId+=i
        elif InstanceId !='':
            return int(InstanceId)
        if i == '-':
            ClassID+=1
    return ""

I don’t want to use something like this. I would like to use regular expressions.

>Solution :

The regex pattern: /(\d{10})-(\d{9}) the brackets are needed to identify the groups of digits, the {} specifies an exact occurrence of a repetition, doc.

# urls separated by a white space
urls = 'https://primer.text.com/sdfg/8406758680-345386743-DSS1-S%20Jasd%12Odsfr%12Iwetds-Osdgf/ https://primer.text.com/sdfg/8945879094-849328844-DPE-S%20Jsdfe%12OIert-Isdfu/ https://primer.text.com/sdfg/8493093053-292494834-QW23%23Wsdfg%23Iprf%64Uiojn%32Asdfg-Werts/'

urls = urls.split() # as list

import re

ids = [re.search(r'/(\d{10})-(\d{9})', url).groups() for url in urls]
print(list(zip(*ids)))

Output

[('8406758680', '8945879094', '8493093053'), ('345386743', '849328844', '292494834')]
Add a comment

Leave a Reply

Keep Up to Date with the Most Important News

By pressing the Subscribe button, you confirm that you have read and are agreeing to our Privacy Policy and Terms of Use

Discover more from Dev solutions

Subscribe now to keep reading and get access to the full archive.

Continue reading