Follow

Keep Up to Date with the Most Important News

By pressing the Subscribe button, you confirm that you have read and are agreeing to our Privacy Policy and Terms of Use
Contact

Remove all occurrences of duplicate lines in bash or Python and getting only and only the unique lines

I have already tried the solution here but it gives me an empty file, even though I have non-duplicated unique lines.

I have a large text file (2GB) containing very long strings in each line.

AB02819380213.   : (( 00 99   -   MO:ASKDJIO*U* HIUGHUHAHUHHA AUCCGTCTTCTTTTTTA FFFFFFFFFFFFFFFFFFFFF:FFFFFFFFFFFFFFFFF
a01219f8b
NJSAJDH*)8888-   + 99 100.    -   NKJJABHASDGASGYUOISADIJIJA  TCTCTCTTTCTACACTAATCACAATACTACA FFFFFFFFFFF
a023129ab
NJSAJDH*)8888-   + 99 100.    -   NKJJABHASDGASGYUOISADIJIJA  TCTCTCTTTCTACACTAATCACAATACTACA FFFFFFFFFFF
000axa2381a
AB02819380213.   : (( 00 99   -   MO:ASKDJIO*U* HIUGHUHAHUHHA AUCCGTCTTCTTTTTTA FFFFFFFFFFFFFFFFFFFFF:FFFFFFFFFFFFFFFFF

The expected output here would be

MEDevel.com: Open-source for Healthcare and Education

Collecting and validating open-source software for healthcare, education, enterprise, development, medical imaging, medical records, and digital pathology.

Visit Medevel

a01219f8b
a023129ab
000axa2381a

How can I do this in bash or Python?

>Solution :

If you are not worried about the ordering of the output:

$ awk '{a[$0]++}END{for (i in a) if (a[i] == 1) print i}' file
000axa2381a
a01219f8b
a023129ab

Array a will hold the count of occurrence of each line. And in the end, print when the count is 1.

Add a comment

Leave a Reply

Keep Up to Date with the Most Important News

By pressing the Subscribe button, you confirm that you have read and are agreeing to our Privacy Policy and Terms of Use

Discover more from Dev solutions

Subscribe now to keep reading and get access to the full archive.

Continue reading