Follow

Keep Up to Date with the Most Important News

By pressing the Subscribe button, you confirm that you have read and are agreeing to our Privacy Policy and Terms of Use
Contact

Convert TSV file data to a dataframe, which can be pushed to database

We have TSV files which holds IOT data, want to convert to table like structure using pandas. I have worked on TSV data, similar to given below, were the logics goes like

  1. read the file
  2. Add new column names
  3. do transpose
  4. reindex

This is bit challenging as explained, col1 to col3 is dimension data and remaining is fact data

tsv file data looks as below

MEDevel.com: Open-source for Healthcare and Education

Collecting and validating open-source software for healthcare, education, enterprise, development, medical imaging, medical records, and digital pathology.

Visit Medevel

col1 qweqweq
col2 345435
col3 01/01/2024 35:08:09
col4 1
col5 0
col4 0
col5 0
col4 1
col5 1
col4 0
col5 1

Want to project as table like structure

col1 col2 col3 col4 col5
qweqweq 345435 01/01/2024 35:08:09 1 0
qweqweq 345435 01/01/2024 35:08:09 0 0
qweqweq 345435 01/01/2024 35:08:09 1 1
qweqweq 345435 01/01/2024 35:08:09 0 1

col4 and col5 can differ in each IOT file. How to achieve with python, pandas?

>Solution :

Assuming you can rely on "col1" to define the groups, you can use a pivot after de-duplicating the rows with cumsum and groupby.cumcount, and groupby.ffill:

df = (pd.read_csv('input_file.tsv', sep='\t', header=None)
        .assign(index=lambda x: x[0].eq('col1').cumsum(),
                n=lambda x: x.groupby(['index', 0]).cumcount())
        .pivot(index=['index', 'n'], columns=0, values=1)
        .groupby(level='index').ffill()
        .reset_index(drop=True).rename_axis(columns=None)
     )

Output:

      col1    col2                 col3 col4 col5
0  qweqweq  345435  01/01/2024 35:08:09    1    0
1  qweqweq  345435  01/01/2024 35:08:09    0    0
2  qweqweq  345435  01/01/2024 35:08:09    1    1
3  qweqweq  345435  01/01/2024 35:08:09    0    1

Reproducible input:

import io

input_file = io.StringIO('''col1\tqweqweq
col2\t345435
col3\t01/01/2024 35:08:09
col4\t1
col5\t0
col4\t0
col5\t0
col4\t1
col5\t1
col4\t0
col5\t1''')

Intermediates:

# before pivot
       0                    1  index  n
0   col1              qweqweq      1  0
1   col2               345435      1  0
2   col3  01/01/2024 35:08:09      1  0
3   col4                    1      1  0
4   col5                    0      1  0
5   col4                    0      1  1
6   col5                    0      1  1
7   col4                    1      1  2
8   col5                    1      1  2
9   col4                    0      1  3
10  col5                    1      1  3

# before the cleanup-step:
0           col1    col2                 col3 col4 col5
index n                                                
1     0  qweqweq  345435  01/01/2024 35:08:09    1    0
      1  qweqweq  345435  01/01/2024 35:08:09    0    0
      2  qweqweq  345435  01/01/2024 35:08:09    1    1
      3  qweqweq  345435  01/01/2024 35:08:09    0    1
Add a comment

Leave a Reply

Keep Up to Date with the Most Important News

By pressing the Subscribe button, you confirm that you have read and are agreeing to our Privacy Policy and Terms of Use

Discover more from Dev solutions

Subscribe now to keep reading and get access to the full archive.

Continue reading