Home How can I remove duplicates only once an X number of occurrences is reached with awk?

Questions

How can I remove duplicates only once an X number of occurrences is reached with awk?

byMR

June 9, 2022

I know how to use awk to remove duplicate lines in a file:

awk '!x[$0]++' myfile.txt

But how can I remove the duplicates only if there are more than two occurrences of this duplicate?

For example:

apple
apple
banana
apple
pear
banana
cherry

would become:

banana
pear
banana
cherry

Thanks in advance!

>Solution :

I would harness GNU AWK for this task following following way, let file.txt content be

apple
apple
banana
apple
pear
banana
cherry

then

awk 'FNR==NR{cnt[$0]+=1;next}cnt[$0]<=2' file.txt file.txt

gives output

banana
pear
banana
cherry

Explanation: This is 2-pass approach. FNR=NR (current file number of row equal to total number of row) does hold true only for 1st file, here I simply count number of occurences in file.txt by increasing (+=) value in array cnt under key being whole line ($0) by 1 then I instruct GNU AWK to go to next line as I do not want to do anything else. After that only lines which fullfill number of occurrences is less or equal two are outputed. Note: file.txt file.txt is intentional.

(tested in gawk 4.2.1)