I have a data feed that I download on a regular bases into a csv. It looks like this
TABLE # 196712 / 9000_
>= 10 : 0.002
>= 5 : 0.001
>= 2 : 0.0005
>= 1 : 0.0002
>= 0.5 : 0.0001
>= 0.2 : 0.0001
>= 0.1 : 0.0001
>= 0.0001 : 0.0001
TABLE # 196714 / Dark
>= 0.0001 : 5e-05
TABLE # 196715 / GBD
>= 25 : 0.01
>= 10 : 0.005
>= 5 : 0.0025
>= 0.1 : 0.001
>= 0.0005 : 0.005
I would like to parse the file and categorize the data into a dictionary, where the number after the hash is a unique id (the new dict key) and the following rows (starting with >=) are volumes plus associated penalty values.
s.th like this would work:
{196712: [(10,0.002),(5,0.001),(2,0.0005),(1,0.0002),(0.5,0.0001),(0.2,0.0001),(0.1,0.0001),(0.0001, 0.0001)],
196714: [(0.0001,5e-05)],
196715: [(25,0.01),(10,0.005),(5,0.0025),(0.1,0.001),(0.0005,0.005)]}
What I would do to filter it outside python would be a grep and get the following lines, however the varying number of lines between IDs makes it more complex. Any other suggested more convenient data structure could be used as well.
>Solution :
Try:
s = """\
TABLE # 196712 / 9000_
>= 10 : 0.002
>= 5 : 0.001
>= 2 : 0.0005
>= 1 : 0.0002
>= 0.5 : 0.0001
>= 0.2 : 0.0001
>= 0.1 : 0.0001
>= 0.0001 : 0.0001
TABLE # 196714 / Dark
>= 0.0001 : 5e-05
TABLE # 196715 / GBD
>= 25 : 0.01
>= 10 : 0.005
>= 5 : 0.0025
>= 0.1 : 0.001
>= 0.0005 : 0.005"""
import re
out = {}
for table, data in re.findall(
r"^TABLE # (\d+).*?\n(.*?)(?=^TABLE|\Z)", s, flags=re.M | re.S
):
table = int(table)
for a, b in re.findall(r"([\de.+-]+)\s*:\s*([\de.+-]+)", data):
out.setdefault(table, []).append((float(a), float(b)))
print(out)
Prints:
{
196712: [
(10.0, 0.002),
(5.0, 0.001),
(2.0, 0.0005),
(1.0, 0.0002),
(0.5, 0.0001),
(0.2, 0.0001),
(0.1, 0.0001),
(0.0001, 0.0001),
],
196714: [(0.0001, 5e-05)],
196715: [
(25.0, 0.01),
(10.0, 0.005),
(5.0, 0.0025),
(0.1, 0.001),
(0.0005, 0.005),
],
}