Follow

Keep Up to Date with the Most Important News

By pressing the Subscribe button, you confirm that you have read and are agreeing to our Privacy Policy and Terms of Use
Contact

Regex parsing of text file

I have a large text file which contains information as follows:

0 / END OF ONE DATA, BEGIN SECOND DATA
361,315,0,'1 ',1,1,1,0,0,2,'NAT1    ',1,1115,1,0,0,0,0,0,0
0.0055501,0.12595,100
1,69,0,100,100,100,1,36,1.1,0.9,1.04283,1.001283,33,0,0,0,    /*[name1 ]*/
0.975,138
481,417,0,'1 ',1,1,1,0,0,2,'KAT1    ',1,115,1,0,0,0,0,0,0
0.00762817,0.14163,60
1,69,0,60,60,60,1,48,1.1,0.9,1.011735,0.917735,33,0,0,0,    /*[name2 ]*/
0 / END OF SECOND DATA, BEGIN THIRD DATA

I want to get the following in a dataframe:

name1
name2

I tried the following:

MEDevel.com: Open-source for Healthcare and Education

Collecting and validating open-source software for healthcare, education, enterprise, development, medical imaging, medical records, and digital pathology.

Visit Medevel

import os, pandas as pd
from io import StringIO
fn = r'C:\Users\asdert\Downloads\Network.RAW'
file = open(fn)
line = file.read()  # .replace("\n", "$$$$$")
file.close()
start = line.find('END OF ONE DATA, BEGIN SECOND DATA') + 1
end = line.find('END OF SECOND DATA, BEGIN THIRD DATA')
branchData = line[start:end]
df = pd.read_csv(StringIO(branchData), sep=r'\n')

I am not sure how to approach this. Basically I have to parse text between /* and */ and ignore lines which don’t have /* and */

>Solution :

You can do away with regex if you have a single name per line:

import pandas as pd
names = []
filepath = "<PATH_TO_YOUR_FILE>"
with open(filepath, 'r') as f:            # open file for reading line by line
    for line in f:                        # read line by line
        start = line.find('/*[')          # get index of /*[ substring
        end = line.find(']*/', start+3)   # get index of ]*/ substring after found start+3 index
        if start >= 0 and end >= 0:       # if indices found OK
            names.append(line[start+3:end].strip()) # Put the value in between into names list

df = pd.DataFrame({'names': names})       # init the dataframe
>>> df
# =>   names
#   0  name1
#   1  name2

Also, see this Python demo.

Add a comment

Leave a Reply

Keep Up to Date with the Most Important News

By pressing the Subscribe button, you confirm that you have read and are agreeing to our Privacy Policy and Terms of Use

Discover more from Dev solutions

Subscribe now to keep reading and get access to the full archive.

Continue reading