Follow

Keep Up to Date with the Most Important News

By pressing the Subscribe button, you confirm that you have read and are agreeing to our Privacy Policy and Terms of Use
Contact

How to convert text into structured data, taking into account missing fields, in Python?

First, apologies if this sounds too basic. I have the following semi-structured data in text format, I need to parse these into a structured format:
example:

Name
Alex

Address
14 high street
London

Color
blue
red

Name

Bob
Color
black

**Note that Alex has two colors, while Bob does not have an address. **

I want something that looks like this:

MEDevel.com: Open-source for Healthcare and Education

Collecting and validating open-source software for healthcare, education, enterprise, development, medical imaging, medical records, and digital pathology.

Visit Medevel

example output

I think the right way is using regular expressions, but I’m struggling to split the text properly since some fields may be missing. What’s a proper clean way to do this?

text='Name\nAlex\n\nAddress\n14 high street\nLondon\n\nColor\nblue\nred\n\nName\nBob\nColor\nblack'

profiles=re.split('(Name\n)', text, flags=re.IGNORECASE)
for profile in profiles:
    #get name
    name=re.split('(Name\n)|(Address\n)|(Color\n)', profile.strip(), flags=re.IGNORECASE)[0]
    print(name)
    
    #get address

    #get color

>Solution :

Try:

s = """\
Name
Alex

Address
14 high street
London

Color
blue
red

Name

Bob
Color
black"""


import pandas as pd
from itertools import groupby

colnames = ["Name", "Address", "Color"]


col1, col2 = [], []
for k, g in groupby(
    (l for l in s.splitlines() if l.strip()), lambda l: l in colnames
):
    (col2, col1)[k].append(" ".join(g))

df = pd.DataFrame({"col1": col1, "col2": col2})
df = df.assign(col3=df.col1.eq("Name").cumsum()).pivot(
    index="col3", columns="col1", values="col2"
)
df.index.name, df.columns.name = None, None


df["Color"] = df["Color"].str.split()
df = df.explode("Color").fillna("")

print(df[colnames])

Prints:

   Name                Address  Color
1  Alex  14 high street London   blue
1  Alex  14 high street London    red
2   Bob                         black
Add a comment

Leave a Reply

Keep Up to Date with the Most Important News

By pressing the Subscribe button, you confirm that you have read and are agreeing to our Privacy Policy and Terms of Use

Discover more from Dev solutions

Subscribe now to keep reading and get access to the full archive.

Continue reading