extracting columns, skipping certain rows in a file for data processing

June 5, 2023

I am trying to process the input.txt using the test.py script to extract specific information as shown in the expected output. I have got the basic stub, but the regex apparently is not extracting the specific column details I am expecting. I have shown the expected output for your reference.

In general, I am looking for a [XXXYY] {TAG} pattern and once I find that pattern, if the next column starts with J, extract column 1, column 2 and (first 3 characters of) column3. I am also interested in knowing how to remove certain lines after [00033] GND ( and [00272] POS_3V3) until I see the next [XXXYY] {TAG} pattern. I am restricted to using python 2.7.5, re and csv library and cannot use pandas.

input.txt

<<< Test List >>>
Mounting Hole                   MH1            APBC_MH_3.2x7cm
Mounting Hole                   MH2            APBC_MH_3.2x7cm
Mounting Hole                   MH3            APBC_MH_3.2x7cm
Mounting Hole                   MH4            APBC_MH_3.2x7cm

[00001] DEBUG_SCAR_RX
        J1         B30     PIO37          PASSIVE     TRA6-70-01.7-R-4-7-F-UG
        R2         2       2              PASSIVE     4.7kR

[00002] DEBUG_SCAR_TX
        J1         B29     PIO36          PASSIVE     TRA6-70-01.7-R-4-7-F-UG

[00003] DYOR_DAT_0
        J2         B12     APB10_CC_P     PASSIVE     TRA6-70-01.7-R-4-7-F-UG

[00033] GND
        DP1        5       5              PASSIVE     MECH, DIP_SWITCH, FFFN-04F-V
        DP1        6       6              PASSIVE     MECH, DIP_SWITCH, FFFN-04F-V
        DP1        7       7              PASSIVE     MECH, DIP_SWITCH, FFFN-04F-V

[00271] POS_3.3V_INH
        Q2         3       DRAIN          PASSIVE     2N7002
        R34        2       2              PASSIVE     4.7kR

[00272] POS_3V3
        J1         B13     FETO_FAT       PASSIVE     TRA6-70-01.7-R-4-7-F-UG
        J1         B14     FETO_FAT       PASSIVE     TRA6-70-01.7-R-4-7-F-UG
        J2         B59     FETO_HDB       PASSIVE     TRA6-70-01.7-R-4-7-F-UG

test.py

import re

# Read the input file
with open('input.txt', 'r') as file:
    content = file.readlines()

# Process the data and extract the required information
result = []
component_name = ""
for line in content:
    line = line.strip()
    if line.startswith("["):
        s = re.sub(r"([\[0-9]+\]) (\w+)$", r"\2", line)
    elif line.startswith("J"):
        sp = re.sub(r"^(\w+)\s+(\w+)\s+(\w+)", r"\1\2", line)
        print("%s\t%s" % (s, sp))

output

DEBUG_SCAR_RX   J1B30          PASSIVE     TRA6-70-01.7-R-4-7-F-UG
DEBUG_SCAR_TX   J1B29          PASSIVE     TRA6-70-01.7-R-4-7-F-UG
DYOR_DAT_0  J2B12     PASSIVE     TRA6-70-01.7-R-4-7-F-UG
POS_3V3 J1B13       PASSIVE     TRA6-70-01.7-R-4-7-F-UG
POS_3V3 J1B14       PASSIVE     TRA6-70-01.7-R-4-7-F-UG
POS_3V3 J2B59       PASSIVE     TRA6-70-01.7-R-4-7-F-UG

expected

DEBUG_SCAR_RX   J1 B30 PIO
DEBUG_SCAR_TX   J1 B29 PIO
DYOR_DAT_0  J2 B12 APB

>Solution :

Maybe you can use:

import re

TAGS = ['DEBUG_SCAR_RX', 'DEBUG_SCAR_TX', 'DYOR_DAT_0']

data = []
with open('input.txt') as file:
    for row in file:
        row = row.strip()       
        if row.startswith('['):
            tag = row.split(']')[1].strip()
        elif row == '':
            continue
        else:
            cols = re.split('\s+', row)
            if cols[0].startswith('J') and tag in TAGS:
                data.append([tag, cols[0], cols[1], cols[3][:3]])

Output:

# '2.7.18 (default, Jan 23 2023, 08:22:06) \n[GCC 12.2.0]'
>>> data
[['DEBUG_SCAR_RX', 'J1', 'B30', 'PIO'],
 ['DEBUG_SCAR_TX', 'J1', 'B29', 'PIO'],
 ['DYOR_DAT_0', 'J2', 'B12', 'APB']]