I know how to use awk to remove duplicate lines in a file:
awk '!x[$0]++' myfile.txt
But how can I remove the duplicates only if there are more than two occurrences of this duplicate?
For example:
apple
apple
banana
apple
pear
banana
cherry
would become:
banana
pear
banana
cherry
Thanks in advance!
>Solution :
I would harness GNU AWK for this task following following way, let file.txt content be
apple
apple
banana
apple
pear
banana
cherry
then
awk 'FNR==NR{cnt[$0]+=1;next}cnt[$0]<=2' file.txt file.txt
gives output
banana
pear
banana
cherry
Explanation: This is 2-pass approach. FNR=NR (current file number of row equal to total number of row) does hold true only for 1st file, here I simply count number of occurences in file.txt by increasing (+=) value in array cnt under key being whole line ($0) by 1 then I instruct GNU AWK to go to next line as I do not want to do anything else. After that only lines which fullfill number of occurrences is less or equal two are outputed. Note: file.txt file.txt is intentional.
(tested in gawk 4.2.1)