Follow

Keep Up to Date with the Most Important News

By pressing the Subscribe button, you confirm that you have read and are agreeing to our Privacy Policy and Terms of Use
Contact

Match pattern and put following lines into data structure

I have a data feed that I download on a regular bases into a csv. It looks like this

TABLE # 196712 / 9000_
>= 10   : 0.002
>= 5    : 0.001
>= 2    : 0.0005
>= 1    : 0.0002
>= 0.5  : 0.0001
>= 0.2  : 0.0001
>= 0.1  : 0.0001
>= 0.0001   : 0.0001
TABLE # 196714 / Dark
>= 0.0001   : 5e-05
TABLE # 196715 / GBD
>= 25   : 0.01
>= 10   : 0.005
>= 5    : 0.0025
>= 0.1  : 0.001
>= 0.0005   : 0.005

I would like to parse the file and categorize the data into a dictionary, where the number after the hash is a unique id (the new dict key) and the following rows (starting with >=) are volumes plus associated penalty values.

s.th like this would work:

MEDevel.com: Open-source for Healthcare and Education

Collecting and validating open-source software for healthcare, education, enterprise, development, medical imaging, medical records, and digital pathology.

Visit Medevel

{196712: [(10,0.002),(5,0.001),(2,0.0005),(1,0.0002),(0.5,0.0001),(0.2,0.0001),(0.1,0.0001),(0.0001, 0.0001)], 
 196714: [(0.0001,5e-05)], 
 196715: [(25,0.01),(10,0.005),(5,0.0025),(0.1,0.001),(0.0005,0.005)]}

What I would do to filter it outside python would be a grep and get the following lines, however the varying number of lines between IDs makes it more complex. Any other suggested more convenient data structure could be used as well.

>Solution :

Try:

s = """\
TABLE # 196712 / 9000_
>= 10   : 0.002
>= 5    : 0.001
>= 2    : 0.0005
>= 1    : 0.0002
>= 0.5  : 0.0001
>= 0.2  : 0.0001
>= 0.1  : 0.0001
>= 0.0001   : 0.0001
TABLE # 196714 / Dark
>= 0.0001   : 5e-05
TABLE # 196715 / GBD
>= 25   : 0.01
>= 10   : 0.005
>= 5    : 0.0025
>= 0.1  : 0.001
>= 0.0005   : 0.005"""

import re

out = {}
for table, data in re.findall(
    r"^TABLE # (\d+).*?\n(.*?)(?=^TABLE|\Z)", s, flags=re.M | re.S
):
    table = int(table)
    for a, b in re.findall(r"([\de.+-]+)\s*:\s*([\de.+-]+)", data):
        out.setdefault(table, []).append((float(a), float(b)))

print(out)

Prints:

{
    196712: [
        (10.0, 0.002),
        (5.0, 0.001),
        (2.0, 0.0005),
        (1.0, 0.0002),
        (0.5, 0.0001),
        (0.2, 0.0001),
        (0.1, 0.0001),
        (0.0001, 0.0001),
    ],
    196714: [(0.0001, 5e-05)],
    196715: [
        (25.0, 0.01),
        (10.0, 0.005),
        (5.0, 0.0025),
        (0.1, 0.001),
        (0.0005, 0.005),
    ],
}
Add a comment

Leave a Reply

Keep Up to Date with the Most Important News

By pressing the Subscribe button, you confirm that you have read and are agreeing to our Privacy Policy and Terms of Use

Discover more from Dev solutions

Subscribe now to keep reading and get access to the full archive.

Continue reading