I’m trying to read through a large file in which I have marked the start and end lines of each segment. I’m extracting a component of each segment using regex.
What I don’t understand is that after the first inner loop, my code seems to have closed the file and I don’t get the desired output.
Simplified code below
with open("data_full", 'r') as file:
for x in position:
print(x)
s = position[x]['start']
e = position[x]['end']
title = []
abs = []
mesh = []
ti_prev = False
for i,line in enumerate(file.readlines()[s:e]):
print(i)
print(s,e)
if re.search(r'(?<=TI\s{2}-\s).*', line) is not None and ti_prev is False:
title.append(re.search(r'(?<=TI\s{2}-\s).*', line).group())
ti_prev = True
line_mark = i
if re.search(r'(?<=\s{6}).*',line) is not None and ti_prev is True and i == (line_mark+1):
title.append(re.search(r'(?<=\s{6}).*',line).group())
else:
pass
data[x]['title']=title
What I think has happened, is that after the first inner loop file.readlines() does not work since the file is closed. But I don’t understand why, since it’s within my with open loop.
My alternative is to read the file for each segment (9k+ segments) and is not doing wonders to my performance.
Any suggestions are welcomed with thanks !
>Solution :
It looks like the file.readlines() method reads the entire file and returns a list of the lines. Once the file has been read, the for loop in the second block of code is operating on the list of lines and not the file itself. This means that the for loop will only run once and will not loop through the remainder of the file.
To fix this, you can move the call to file.readlines() outside of the outer for loop. This will cause the entire file to be read and stored in a list before the for loop starts. Then, inside the for loop, you can use the enumerate function on the list of lines to loop through the lines in the segment.
Here’s an example of how you could modify your code to fix the issue:
# Read the entire file and store the lines in a list
lines = file.readlines()
# Loop through the positions in the `position` dictionary
for x in position:
# Get the start and end indices of the current segment
s = position[x]['start']
e = position[x]['end']
# Initialize variables to store the title, abstract, and mesh terms
title = []
abs = []
mesh = []
# Set a flag to track whether the title has been found
ti_prev = False
# Loop through the lines in the current segment
for i, line in enumerate(lines[s:e]):
# Check if the current line is a title line
if re.search(r'(?<=TI\s{2}-\s).*', line) is not None and ti_prev is False:
# If it is a title line, store it in the `title` list and set the flag
title.append(re.search(r'(?<=TI\s{2}-\s).*', line).group())
ti_prev = True
line_mark = i
# Check if the current line is a continuation of the title
if re.search(r'(?<=\s{6}).*',line) is not None and ti_prev is True and i == (line_mark+1):
# If it is, store it in the `title` list
title.append(re.search(r'(?<=\s{6}).*',line).group())
else:
pass
# Store the title in the `data` dictionary
data[x]['title'] = title
Hope this helps!