Convert a text file with different key value pairs to a csv file

February 17, 2022

I have a text file as shown below in the sample that I want to convert to a csv file (currently using Pandas).
The challenge is that I don’t know in advance what the keys (column headers are) and in what order they are.
The final column order is not important.

Sample file:

name: john| dob: 10-06-1960| address: 4853 Radio Park Drive
name: jane| dob: 07-10-1973| address: 1537 Timbercrest Road| mobile: 706-289-6746
name: liam| dob: 12-08-1986| address: 4853 498 Fairmont Avenue| telephone number: 706-687-5021
name: chris| dob: 09-12-1965| address: 485 Green Avenue| state: California| Telephone Number: 510-855-5213

Desired output:

Name | dob        | address                  | mobile       | telephone number | state      |
-----+------------+--------------------------+--------------+------------------+------------+
john | 10-06-1960 | 4853 Radio Park Drive    |              |                  |            |
jane | 07-10-1973 | 1537 Timbercrest Road    | 706-289-6746 |                  |            |
liam | 12-08-1986 | 4853 498 Fairmont Avenue |              | 706-687-5021     |            |
chris| 09-12-1965 | 485 Green Avenue         |              | 510-855-5213     | California |

My Code:

import pandas as pd

df = pd.DataFrame()

file = open('D:\sample.log', 'r')
        
lines = file.readlines()

for line in lines:

    pairs = line.split('|')

    my_dict = {}

    for pair in pairs:
       key = pair.split(': ')[0].strip()
       value = pair.split(': ')[1].strip()
            
       my_dict[key] = value

       df.append(my_dict, ignore_index=True)

This way of appending is very slow. How can I make this faster.

Or is there a much better solution (for example via a json string)?

>Solution :

Assuming you already split your data into lines you then need to process them into records such as:

{' address': '4853 Radio Park Drive', ' dob': '10-06-1960', 'name': 'john'}

Each line needs to be split into fields:

>>> line = 'name: john| dob: 10-06-1960| address: 4853 Radio Park Drive'
>>> line.split('|')
['name: john', ' dob: 10-06-1960', ' address: 4853 Radio Park Drive']

Then each field needs to be split into the name of the column and the value itself:

>>> field = 'name: John'
>>> field.split(': ')
['name', 'john']

Produce these using a generator so they are passed to the DataFrame constructor as soon as they are processed:

pd.DataFrame.from_records(
    dict(field.split(': ') for field in line.split('|'))
    for line in lines
)

This produces the following output:

    name         dob                   address        mobile  telephone number       state  Telephone Number
0   john  10-06-1960     4853 Radio Park Drive           NaN               NaN         NaN               NaN
1   jane  07-10-1973     1537 Timbercrest Road  706-289-6746               NaN         NaN               NaN
2   liam  12-08-1986  4853 498 Fairmont Avenue           NaN      706-687-5021         NaN               NaN
3  chris  09-12-1965          485 Green Avenue           NaN               NaN  California      510-855-5213