Follow

Keep Up to Date with the Most Important News

By pressing the Subscribe button, you confirm that you have read and are agreeing to our Privacy Policy and Terms of Use
Contact

Python filtering and extract text

I am pretty new to coding and lately I chanced upon something which I wanted to try solving with Python. Below is the text content of which I wanted to query, extract certain fields into a new file. The text content is repetitive and can go up to several thousands of line. Currently, I am only able to parse and output the first two columns which still look wrong. Hope to seek some guidance here. Cheers!

Original TXT File:

Classroom arrangement : 1A-1

(Student Name: Jess, Subject: EC001, Time: 9am – 10am)

(Student Name: Whit, Subject: EC001, Time: 9am – 10am)

(Student Name: Jon, Subject: EC0011, Time: 11am – 12pm)

(Student Name: Kevin, Subject: EC011, Time: 11am – 12pm)

(Student Name: Jess, Subject: EC011, Time: 11am – 12pm)

MEDevel.com: Open-source for Healthcare and Education

Collecting and validating open-source software for healthcare, education, enterprise, development, medical imaging, medical records, and digital pathology.

Visit Medevel

Classroom arrangement : 1A-2

(Student Name: Jess, Subject: EC002, Time: 11am – 12pm)

(Student Name: Whit, Subject: EC002, Time: 11am – 12pm)

(Student Name: Jon, Subject: EC002, Time: 11am – 12pm)

(Student Name: Kevin, Subject: EC002, Time: 11am – 12pm)

(Student Name: Claire, Subject: EC011, Time: 2pm – 3pm)

(Student Name: Joshua, Subject: EC0011, Time: 2pm – 3pm)

(Student Name: Florence, Subject: EC011, Time: 2pm – 3pm)

(Student Name: Neil, Subject: EC011, Time: 2am – 3pm)

Intended Output:

Classroom: 1A-1, Jess, Subject: EC001 Time: 9am – 10am, Subject: EC011, Time: 11am – 12pm

Classroom: 1A-1, Whit, Subject: EC001 Time: 9am – 10am

Classroom: 1A-1, Jon, Subject: EC0011 Time: 11am – 12pm

Classroom: 1A-1, Kevin, Subject: EC011 Time: 11am – 12pm

Classroom: 1A-2, Jess, Subject: EC002 Time: 11am – 12pm

Classroom: 1A-2, Jon, Subject: EC002, Time: 11am – 12pm

Classroom: 1A-2, Whit, Subject: EC002 Time: 11am – 12pm

Classroom: 1A-2, Kevin, Subject: EC002, Time: 11am – 12pm

Classroom: 1A-2, Claire, Subject: EC011, Time: 2pm – 3pm

Classroom: 1A-2, Joshua, Subject: EC0011, Time: 2pm – 3pm

Classroom: 1A-2, Florence, Subject: EC011, Time: 2pm – 3pm

Classroom: 1A-2, Neil, Subject: EC011, Time: 2am – 3pm

I tried passing readlines into modules before performing an output in the console, but it seems really wrong because I need the Class 1A-1 preceding on each line.

Current Output:

Class 1A-1

Jess, Subject: EC001 Time: 9am – 10am

Jess, Subject: EC011, Time: 11am – 12pm

Whit, Subject: EC001 Time: 9am – 10am

Jon, Subject: EC0011 Time: 11am – 12pm

Kevin, Subject: EC011 Time: 11am – 12pm

>Solution :

Here’s your solution. You’ll need to tweak the values for the input/output filepaths:

classroom.py

import collections


def ingest(infilepath):
    """
    Read all the input from the input file.
    Store it in a dictionary so that we can parse it out later.
    We'll use a collections.defaultdict to make life easier
        {classroom name: {student name: [classes...]} }
            key'd by student name since a student can have multiple courses in a classroom
    """
    answer = collections.defaultdict(lambda: collections.defaultdict(list))
    with open(infilepath) as infile:
        classes = infile.read().split('\n\n')  # divide the input into blocks of classrooms
        classes = [c.strip() for c in classes]  # strip out any extra whitespace

    for classblock in classes:
        name, *records = classblock.splitlines()  # student records per classroom
        name = name.split(':',1)[-1].strip()
        for record in records:
            record = record.replace("(", "").replace(")", '')  # strip out the "()". We don't need that
            kvs = record.split(',')

            d = dict(kv.split(":") for kv in kvs)
            d = {k.strip():v.strip() for k,v in d.items()}

            answer[name][d['Student Name']].append(d)

    return answer


def output(outfilepath, data):
    order = ("Subject", "Time")  # the order in which we want to write the output
    with open(outfilepath, 'w') as outfile:
        for classname, d in data.items():
            for studentname, L in d.items():
                outfile.write(f"Classroom: {classname}, {studentname}, ")
                out = []  # maintain the line output in a list. We'll join everything up later
                for d in L:
                    for k in order:
                        out.append(f"{k}: {d[k]}, ")

                out = ''.join(out)  # this is the file output
                out = out.strip().rstrip(',')  # strip out the trailing ','
                outfile.write(f'{out}\n')


if __name__ == "__main__":
    print('starting')

    data = ingest('path/to/input/file')
    output('path/to/output/file', data)

    print('done')

I used this input (notice the blank lines at the start of the file):



Classroom arrangement : 2A-1
(Student Name: Jess, Subject: EC001, Time: 9am - 10am)
(Student Name: Whit, Subject: EC001, Time: 9am - 10am)
(Student Name: Jon, Subject: EC0011, Time: 11am - 12pm)
(Student Name: Kevin, Subject: EC011, Time: 11am - 12pm)
(Student Name: Jess, Subject: EC011, Time: 11am - 12pm)


Classroom arrangement : 1A-2
(Student Name: Jess, Subject: EC002, Time: 11am - 12pm)
(Student Name: Whit, Subject: EC002, Time: 11am - 12pm)
(Student Name: Jon, Subject: EC002, Time: 11am - 12pm)
(Student Name: Kevin, Subject: EC002, Time: 11am - 12pm)
(Student Name: Claire, Subject: EC011, Time: 2pm - 3pm)
(Student Name: Joshua, Subject: EC0011, Time: 2pm - 3pm)
(Student Name: Florence, Subject: EC011, Time: 2pm - 3pm)
(Student Name: Neil, Subject: EC011, Time: 2am - 3pm)

I got this output:

Classroom: 1A-1, Jess, Subject: EC001, Time: 9am - 10am, Subject: EC011, Time: 11am - 12pm
Classroom: 1A-1, Whit, Subject: EC001, Time: 9am - 10am
Classroom: 1A-1, Jon, Subject: EC0011, Time: 11am - 12pm
Classroom: 1A-1, Kevin, Subject: EC011, Time: 11am - 12pm
Classroom: 1A-2, Jess, Subject: EC002, Time: 11am - 12pm
Classroom: 1A-2, Whit, Subject: EC002, Time: 11am - 12pm
Classroom: 1A-2, Jon, Subject: EC002, Time: 11am - 12pm
Classroom: 1A-2, Kevin, Subject: EC002, Time: 11am - 12pm
Classroom: 1A-2, Claire, Subject: EC011, Time: 2pm - 3pm
Classroom: 1A-2, Joshua, Subject: EC0011, Time: 2pm - 3pm
Classroom: 1A-2, Florence, Subject: EC011, Time: 2pm - 3pm
Classroom: 1A-2, Neil, Subject: EC011, Time: 2am - 3pm

Hope this helps

Add a comment

Leave a Reply

Keep Up to Date with the Most Important News

By pressing the Subscribe button, you confirm that you have read and are agreeing to our Privacy Policy and Terms of Use

Discover more from Dev solutions

Subscribe now to keep reading and get access to the full archive.

Continue reading