We have TSV files which holds IOT data, want to convert to table like structure using pandas. I have worked on TSV data, similar to given below, were the logics goes like
- read the file
- Add new column names
- do transpose
- reindex
This is bit challenging as explained, col1 to col3 is dimension data and remaining is fact data
tsv file data looks as below
col1 qweqweq
col2 345435
col3 01/01/2024 35:08:09
col4 1
col5 0
col4 0
col5 0
col4 1
col5 1
col4 0
col5 1
Want to project as table like structure
| col1 | col2 | col3 | col4 | col5 |
|---|---|---|---|---|
| qweqweq | 345435 | 01/01/2024 35:08:09 | 1 | 0 |
| qweqweq | 345435 | 01/01/2024 35:08:09 | 0 | 0 |
| qweqweq | 345435 | 01/01/2024 35:08:09 | 1 | 1 |
| qweqweq | 345435 | 01/01/2024 35:08:09 | 0 | 1 |
col4 and col5 can differ in each IOT file. How to achieve with python, pandas?
>Solution :
Assuming you can rely on "col1" to define the groups, you can use a pivot after de-duplicating the rows with cumsum and groupby.cumcount, and groupby.ffill:
df = (pd.read_csv('input_file.tsv', sep='\t', header=None)
.assign(index=lambda x: x[0].eq('col1').cumsum(),
n=lambda x: x.groupby(['index', 0]).cumcount())
.pivot(index=['index', 'n'], columns=0, values=1)
.groupby(level='index').ffill()
.reset_index(drop=True).rename_axis(columns=None)
)
Output:
col1 col2 col3 col4 col5
0 qweqweq 345435 01/01/2024 35:08:09 1 0
1 qweqweq 345435 01/01/2024 35:08:09 0 0
2 qweqweq 345435 01/01/2024 35:08:09 1 1
3 qweqweq 345435 01/01/2024 35:08:09 0 1
Reproducible input:
import io
input_file = io.StringIO('''col1\tqweqweq
col2\t345435
col3\t01/01/2024 35:08:09
col4\t1
col5\t0
col4\t0
col5\t0
col4\t1
col5\t1
col4\t0
col5\t1''')
Intermediates:
# before pivot
0 1 index n
0 col1 qweqweq 1 0
1 col2 345435 1 0
2 col3 01/01/2024 35:08:09 1 0
3 col4 1 1 0
4 col5 0 1 0
5 col4 0 1 1
6 col5 0 1 1
7 col4 1 1 2
8 col5 1 1 2
9 col4 0 1 3
10 col5 1 1 3
# before the cleanup-step:
0 col1 col2 col3 col4 col5
index n
1 0 qweqweq 345435 01/01/2024 35:08:09 1 0
1 qweqweq 345435 01/01/2024 35:08:09 0 0
2 qweqweq 345435 01/01/2024 35:08:09 1 1
3 qweqweq 345435 01/01/2024 35:08:09 0 1