I have a text file as shown below in the sample that I want to convert to a csv file (currently using Pandas).
The challenge is that I don’t know in advance what the keys (column headers are) and in what order they are.
The final column order is not important.
Sample file:
name: john| dob: 10-06-1960| address: 4853 Radio Park Drive
name: jane| dob: 07-10-1973| address: 1537 Timbercrest Road| mobile: 706-289-6746
name: liam| dob: 12-08-1986| address: 4853 498 Fairmont Avenue| telephone number: 706-687-5021
name: chris| dob: 09-12-1965| address: 485 Green Avenue| state: California| Telephone Number: 510-855-5213
Desired output:
Name | dob | address | mobile | telephone number | state |
-----+------------+--------------------------+--------------+------------------+------------+
john | 10-06-1960 | 4853 Radio Park Drive | | | |
jane | 07-10-1973 | 1537 Timbercrest Road | 706-289-6746 | | |
liam | 12-08-1986 | 4853 498 Fairmont Avenue | | 706-687-5021 | |
chris| 09-12-1965 | 485 Green Avenue | | 510-855-5213 | California |
My Code:
import pandas as pd
df = pd.DataFrame()
file = open('D:\sample.log', 'r')
lines = file.readlines()
for line in lines:
pairs = line.split('|')
my_dict = {}
for pair in pairs:
key = pair.split(': ')[0].strip()
value = pair.split(': ')[1].strip()
my_dict[key] = value
df.append(my_dict, ignore_index=True)
This way of appending is very slow. How can I make this faster.
Or is there a much better solution (for example via a json string)?
>Solution :
Assuming you already split your data into lines you then need to process them into records such as:
{' address': '4853 Radio Park Drive', ' dob': '10-06-1960', 'name': 'john'}
Each line needs to be split into fields:
>>> line = 'name: john| dob: 10-06-1960| address: 4853 Radio Park Drive'
>>> line.split('|')
['name: john', ' dob: 10-06-1960', ' address: 4853 Radio Park Drive']
Then each field needs to be split into the name of the column and the value itself:
>>> field = 'name: John'
>>> field.split(': ')
['name', 'john']
Produce these using a generator so they are passed to the DataFrame constructor as soon as they are processed:
pd.DataFrame.from_records(
dict(field.split(': ') for field in line.split('|'))
for line in lines
)
This produces the following output:
name dob address mobile telephone number state Telephone Number
0 john 10-06-1960 4853 Radio Park Drive NaN NaN NaN NaN
1 jane 07-10-1973 1537 Timbercrest Road 706-289-6746 NaN NaN NaN
2 liam 12-08-1986 4853 498 Fairmont Avenue NaN 706-687-5021 NaN NaN
3 chris 09-12-1965 485 Green Avenue NaN NaN California 510-855-5213