I have about 2000 files in a directory on a Linux server. In each file, the positions x-y have invoice numbers. Which is the best way to check if there are duplicates across these files and print the file names and values? A simplified version of the problem –
(base) jay:unq jayadevanmaymala$ cat a.txt
xyz1234
xyz1234
pqr4567
(base) jay:unq jayadevanmaymala$ cat ba.txt
lon9876
lon9876
lon4567
In the above 2 files, assuming that the Invoice numbers are in the position 4-8, we have duplicates – "4567" in a.txt and b.txt. If we have duplicates in the same file – as we have 1234 in a.txt, it is fine. No need to print that.I tried to cut the inv numbers, but the output doesn’t have file names. My plan was to cut, get the file names also along with the Invoice numbers, do a unique on the output etc.
>Solution :
Perl to the rescue!
perl -lne '
$in_file{ substr $_, 3, 4 }{$ARGV} = 1;
END {
for $invoice (%in_file) {
print join "\t", $invoice, keys %{ $in_file{$invoice} }
if keys %{ $in_file{$invoice} } > 1;
}
}
' -- *txt
-nreads the input files line by line, running the code for each;-lremoves newlines from the input and adds them toprinted lines;$ARGVcontains the name of the currently open file;- we build a hash of hashes, the first level key is the invoice number, the second level key is the file it was found in;
- see substr for the details on how to extract the invoice number;
- at the end of all input, we print the keys (i.e. invoice numbers) that have more than one file associated with themselves.