1

We have TSV files which holds IOT data, want to convert to table like structure using pandas. I have worked on TSV data, similar to given below, were the logics goes like

  1. read the file
  2. Add new column names
  3. do transpose
  4. reindex

This is bit challenging as explained, col1 to col3 is dimension data and remaining is fact data

tsv file data looks as below

col1 qweqweq
col2 345435
col3 01/01/2024 35:08:09
col4 1
col5 0
col4 0
col5 0
col4 1
col5 1
col4 0
col5 1

Want to project as table like structure

col1 col2 col3 col4 col5
qweqweq 345435 01/01/2024 35:08:09 1 0
qweqweq 345435 01/01/2024 35:08:09 0 0
qweqweq 345435 01/01/2024 35:08:09 1 1
qweqweq 345435 01/01/2024 35:08:09 0 1

col4 and col5 can differ in each IOT file. How to achieve with python, pandas?

2
  • It is CSV or TSV? Commented Jul 9 at 11:19
  • TSV, Tab separated files Commented Jul 9 at 11:22

1 Answer 1

1

Assuming you can rely on "col1" to define the groups, you can use a pivot after de-duplicating the rows with cumsum and groupby.cumcount, and groupby.ffill:

df = (pd.read_csv('input_file.tsv', sep='\t', header=None)
        .assign(index=lambda x: x[0].eq('col1').cumsum(),
                n=lambda x: x.groupby(['index', 0]).cumcount())
        .pivot(index=['index', 'n'], columns=0, values=1)
        .groupby(level='index').ffill()
        .reset_index(drop=True).rename_axis(columns=None)
     )

Output:

      col1    col2                 col3 col4 col5
0  qweqweq  345435  01/01/2024 35:08:09    1    0
1  qweqweq  345435  01/01/2024 35:08:09    0    0
2  qweqweq  345435  01/01/2024 35:08:09    1    1
3  qweqweq  345435  01/01/2024 35:08:09    0    1

Reproducible input:

import io

input_file = io.StringIO('''col1\tqweqweq
col2\t345435
col3\t01/01/2024 35:08:09
col4\t1
col5\t0
col4\t0
col5\t0
col4\t1
col5\t1
col4\t0
col5\t1''')

Intermediates:

# before pivot
       0                    1  index  n
0   col1              qweqweq      1  0
1   col2               345435      1  0
2   col3  01/01/2024 35:08:09      1  0
3   col4                    1      1  0
4   col5                    0      1  0
5   col4                    0      1  1
6   col5                    0      1  1
7   col4                    1      1  2
8   col5                    1      1  2
9   col4                    0      1  3
10  col5                    1      1  3

# before the cleanup-step:
0           col1    col2                 col3 col4 col5
index n                                                
1     0  qweqweq  345435  01/01/2024 35:08:09    1    0
      1  qweqweq  345435  01/01/2024 35:08:09    0    0
      2  qweqweq  345435  01/01/2024 35:08:09    1    1
      3  qweqweq  345435  01/01/2024 35:08:09    0    1

Not the answer you're looking for? Browse other questions tagged or ask your own question.