Follow

Keep Up to Date with the Most Important News

By pressing the Subscribe button, you confirm that you have read and are agreeing to our Privacy Policy and Terms of Use
Contact

How can I remove duplicates only once an X number of occurrences is reached with awk?

I know how to use awk to remove duplicate lines in a file:

awk '!x[$0]++' myfile.txt

But how can I remove the duplicates only if there are more than two occurrences of this duplicate?

MEDevel.com: Open-source for Healthcare and Education

Collecting and validating open-source software for healthcare, education, enterprise, development, medical imaging, medical records, and digital pathology.

Visit Medevel

For example:

apple
apple
banana
apple
pear
banana
cherry

would become:

banana
pear
banana
cherry

Thanks in advance!

>Solution :

I would harness GNU AWK for this task following following way, let file.txt content be

apple
apple
banana
apple
pear
banana
cherry

then

awk 'FNR==NR{cnt[$0]+=1;next}cnt[$0]<=2' file.txt file.txt

gives output

banana
pear
banana
cherry

Explanation: This is 2-pass approach. FNR=NR (current file number of row equal to total number of row) does hold true only for 1st file, here I simply count number of occurences in file.txt by increasing (+=) value in array cnt under key being whole line ($0) by 1 then I instruct GNU AWK to go to next line as I do not want to do anything else. After that only lines which fullfill number of occurrences is less or equal two are outputed. Note: file.txt file.txt is intentional.

(tested in gawk 4.2.1)

Add a comment

Leave a Reply

Keep Up to Date with the Most Important News

By pressing the Subscribe button, you confirm that you have read and are agreeing to our Privacy Policy and Terms of Use

Discover more from Dev solutions

Subscribe now to keep reading and get access to the full archive.

Continue reading