Follow

Keep Up to Date with the Most Important News

By pressing the Subscribe button, you confirm that you have read and are agreeing to our Privacy Policy and Terms of Use
Contact

Unexpected result when I comparing two file to report the difference between them

I´m working with two text files that are similar but not the same.

File 1:

GCF_000739415.1
GCF_001263815.1
GCF_001297745.1
...

File 2:

MEDevel.com: Open-source for Healthcare and Education

Collecting and validating open-source software for healthcare, education, enterprise, development, medical imaging, medical records, and digital pathology.

Visit Medevel

GCA_000739415.1
GCA_001263815.1
...

Here, I´m looking for a specific pattern to differentiate them, his name is ID.
For example, ID´ file 1: GCF_000739415.1, GCF_001263815.1, GCF_001297745.1
ID´file 2: GCA_000739415.1, GCA_001263815.1
The only difference between IDs is GCF versus GCA, this difference it´s only for the database where they come from, but the numbers are the same.
However, file 2 has not GCF_001297745.1 version (GCA_001297745.1), so my goal is to report what IDs are not in both files.
For example, "GC*_001297745.1 is not in the file 2"

With these in mind, I´m using this code:

with open("assembly_summary_genbank.txt", 'r') as f_1:
    contents_1 = f_1.readlines()
with open("assembly_summary_refseq.txt", 'r') as f_2:
    contents_2 = f_2.readlines()

# PART 2: Search for IDs
matches_1 = re.findall("GCF_[0-9]*\.[0-9]", str(contents_1))
matches_2 = re.findall("GCA_[0-9]*\.[0-9]", str(contents_2))
#print(matches_1)
for match in matches_1:
    if match not in matches_2:
        print(f"{match} is not in both files")

My unexpected result is this:

GCF_000739415.1 is not in both files
GCF_001263815.1 is not in both files
GCF_001297745.1 is not in both files

When I need something like this:

GC*_001297745.1 is not in both files

I put * in the third character (F or A) because this is a difference that doesn’t matter. I´m looking for IDs that are not in both files, any comment to fix this unexpected result is welcome.

>Solution :

You could just add capture groups to the regexes so that the GCF_ and GCA_ are not part of the results but do help in the search.

matches_1 = set(re.findall("GCF_([0-9]*\.[0-9])", str(contents_1)))
matches_2 = set(re.findall("GCA_([0-9]*\.[0-9])", str(contents_2)))
for match in matches_1:
    if match not in matches_2:
        print(f"GC*_{match} is not in both files")

Output

GC*_001297745.1 is not in both files

I also made the results sets to avoid duplicates. With them being sets, you can:

for match in matches_1.symmetric_difference(matches_2):
    print(f"GC*_{match} is not in both files")

Which I think will produce a better result since your for loop only finds items from contents_1 that are not in contents_2 but not items that are in contents_2 but not in contents_1.

Add a comment

Leave a Reply

Keep Up to Date with the Most Important News

By pressing the Subscribe button, you confirm that you have read and are agreeing to our Privacy Policy and Terms of Use

Discover more from Dev solutions

Subscribe now to keep reading and get access to the full archive.

Continue reading