Match pattern and put following lines into data structure

October 24, 2022

I have a data feed that I download on a regular bases into a csv. It looks like this

TABLE # 196712 / 9000_
>= 10   : 0.002
>= 5    : 0.001
>= 2    : 0.0005
>= 1    : 0.0002
>= 0.5  : 0.0001
>= 0.2  : 0.0001
>= 0.1  : 0.0001
>= 0.0001   : 0.0001
TABLE # 196714 / Dark
>= 0.0001   : 5e-05
TABLE # 196715 / GBD
>= 25   : 0.01
>= 10   : 0.005
>= 5    : 0.0025
>= 0.1  : 0.001
>= 0.0005   : 0.005

I would like to parse the file and categorize the data into a dictionary, where the number after the hash is a unique id (the new dict key) and the following rows (starting with >=) are volumes plus associated penalty values.

s.th like this would work:

{196712: [(10,0.002),(5,0.001),(2,0.0005),(1,0.0002),(0.5,0.0001),(0.2,0.0001),(0.1,0.0001),(0.0001, 0.0001)], 
 196714: [(0.0001,5e-05)], 
 196715: [(25,0.01),(10,0.005),(5,0.0025),(0.1,0.001),(0.0005,0.005)]}

What I would do to filter it outside python would be a grep and get the following lines, however the varying number of lines between IDs makes it more complex. Any other suggested more convenient data structure could be used as well.

>Solution :

Try:

s = """\
TABLE # 196712 / 9000_
>= 10   : 0.002
>= 5    : 0.001
>= 2    : 0.0005
>= 1    : 0.0002
>= 0.5  : 0.0001
>= 0.2  : 0.0001
>= 0.1  : 0.0001
>= 0.0001   : 0.0001
TABLE # 196714 / Dark
>= 0.0001   : 5e-05
TABLE # 196715 / GBD
>= 25   : 0.01
>= 10   : 0.005
>= 5    : 0.0025
>= 0.1  : 0.001
>= 0.0005   : 0.005"""

import re

out = {}
for table, data in re.findall(
    r"^TABLE # (\d+).*?\n(.*?)(?=^TABLE|\Z)", s, flags=re.M | re.S
):
    table = int(table)
    for a, b in re.findall(r"([\de.+-]+)\s*:\s*([\de.+-]+)", data):
        out.setdefault(table, []).append((float(a), float(b)))

print(out)

Prints:

{
    196712: [
        (10.0, 0.002),
        (5.0, 0.001),
        (2.0, 0.0005),
        (1.0, 0.0002),
        (0.5, 0.0001),
        (0.2, 0.0001),
        (0.1, 0.0001),
        (0.0001, 0.0001),
    ],
    196714: [(0.0001, 5e-05)],
    196715: [
        (25.0, 0.01),
        (10.0, 0.005),
        (5.0, 0.0025),
        (0.1, 0.001),
        (0.0005, 0.005),
    ],
}