Follow

Keep Up to Date with the Most Important News

By pressing the Subscribe button, you confirm that you have read and are agreeing to our Privacy Policy and Terms of Use
Contact

How do I check if there are duplicate values across files at a specific position

I have about 2000 files in a directory on a Linux server. In each file, the positions x-y have invoice numbers. Which is the best way to check if there are duplicates across these files and print the file names and values? A simplified version of the problem –

(base) jay:unq jayadevanmaymala$ cat a.txt 
xyz1234
xyz1234
pqr4567
(base) jay:unq jayadevanmaymala$ cat ba.txt 
lon9876
lon9876
lon4567

In the above 2 files, assuming that the Invoice numbers are in the position 4-8, we have duplicates – "4567" in a.txt and b.txt. If we have duplicates in the same file – as we have 1234 in a.txt, it is fine. No need to print that.I tried to cut the inv numbers, but the output doesn’t have file names. My plan was to cut, get the file names also along with the Invoice numbers, do a unique on the output etc.

MEDevel.com: Open-source for Healthcare and Education

Collecting and validating open-source software for healthcare, education, enterprise, development, medical imaging, medical records, and digital pathology.

Visit Medevel

>Solution :

Perl to the rescue!

perl -lne '
    $in_file{ substr $_, 3, 4 }{$ARGV} = 1;
    END {
        for $invoice (%in_file) {
            print join "\t", $invoice, keys %{ $in_file{$invoice} }
                if keys %{ $in_file{$invoice} } > 1;
        }
    }
' -- *txt
  • -n reads the input files line by line, running the code for each;
  • -l removes newlines from the input and adds them to printed lines;
  • $ARGV contains the name of the currently open file;
  • we build a hash of hashes, the first level key is the invoice number, the second level key is the file it was found in;
  • see substr for the details on how to extract the invoice number;
  • at the end of all input, we print the keys (i.e. invoice numbers) that have more than one file associated with themselves.
Add a comment

Leave a Reply

Keep Up to Date with the Most Important News

By pressing the Subscribe button, you confirm that you have read and are agreeing to our Privacy Policy and Terms of Use

Discover more from Dev solutions

Subscribe now to keep reading and get access to the full archive.

Continue reading