Follow

Keep Up to Date with the Most Important News

By pressing the Subscribe button, you confirm that you have read and are agreeing to our Privacy Policy and Terms of Use
Contact

Convert a text file with different key value pairs to a csv file

I have a text file as shown below in the sample that I want to convert to a csv file (currently using Pandas).
The challenge is that I don’t know in advance what the keys (column headers are) and in what order they are.
The final column order is not important.

Sample file:

name: john| dob: 10-06-1960| address: 4853 Radio Park Drive
name: jane| dob: 07-10-1973| address: 1537 Timbercrest Road| mobile: 706-289-6746
name: liam| dob: 12-08-1986| address: 4853 498 Fairmont Avenue| telephone number: 706-687-5021
name: chris| dob: 09-12-1965| address: 485 Green Avenue| state: California| Telephone Number: 510-855-5213

Desired output:

MEDevel.com: Open-source for Healthcare and Education

Collecting and validating open-source software for healthcare, education, enterprise, development, medical imaging, medical records, and digital pathology.

Visit Medevel

Name | dob        | address                  | mobile       | telephone number | state      |
-----+------------+--------------------------+--------------+------------------+------------+
john | 10-06-1960 | 4853 Radio Park Drive    |              |                  |            |
jane | 07-10-1973 | 1537 Timbercrest Road    | 706-289-6746 |                  |            |
liam | 12-08-1986 | 4853 498 Fairmont Avenue |              | 706-687-5021     |            |
chris| 09-12-1965 | 485 Green Avenue         |              | 510-855-5213     | California |

My Code:

import pandas as pd

df = pd.DataFrame()

file = open('D:\sample.log', 'r')
        
lines = file.readlines()

for line in lines:

    pairs = line.split('|')

    my_dict = {}

    for pair in pairs:
       key = pair.split(': ')[0].strip()
       value = pair.split(': ')[1].strip()
            
       my_dict[key] = value

       df.append(my_dict, ignore_index=True)

This way of appending is very slow. How can I make this faster.

Or is there a much better solution (for example via a json string)?

>Solution :

Assuming you already split your data into lines you then need to process them into records such as:

{' address': '4853 Radio Park Drive', ' dob': '10-06-1960', 'name': 'john'}

Each line needs to be split into fields:

>>> line = 'name: john| dob: 10-06-1960| address: 4853 Radio Park Drive'
>>> line.split('|')
['name: john', ' dob: 10-06-1960', ' address: 4853 Radio Park Drive']

Then each field needs to be split into the name of the column and the value itself:

>>> field = 'name: John'
>>> field.split(': ')
['name', 'john']

Produce these using a generator so they are passed to the DataFrame constructor as soon as they are processed:

pd.DataFrame.from_records(
    dict(field.split(': ') for field in line.split('|'))
    for line in lines
)

This produces the following output:

    name         dob                   address        mobile  telephone number       state  Telephone Number
0   john  10-06-1960     4853 Radio Park Drive           NaN               NaN         NaN               NaN
1   jane  07-10-1973     1537 Timbercrest Road  706-289-6746               NaN         NaN               NaN
2   liam  12-08-1986  4853 498 Fairmont Avenue           NaN      706-687-5021         NaN               NaN
3  chris  09-12-1965          485 Green Avenue           NaN               NaN  California      510-855-5213
Add a comment

Leave a Reply

Keep Up to Date with the Most Important News

By pressing the Subscribe button, you confirm that you have read and are agreeing to our Privacy Policy and Terms of Use

Discover more from Dev solutions

Subscribe now to keep reading and get access to the full archive.

Continue reading