Follow

Keep Up to Date with the Most Important News

By pressing the Subscribe button, you confirm that you have read and are agreeing to our Privacy Policy and Terms of Use
Contact

Replace duplicate values across columns of CSV file

I have a headerless CSV file that is sorted on the first column. When the 2nd and 3rd columns are identical, I want to "collapse" them into one – i.e. replace the last column with a comma, which would combine with the first comma to make ,,, indicating the third column is empty. In other words, this:

0000001,11111,66666
0000002,12121,22222
0000003,33333,33333
0000004,74747,44444
0000005,12345,12345

…becomes this:

0000001,11111,66666
0000002,12121,22222
0000003,33333,,
0000004,74747,44444
0000005,12345,,

I’ve tried various permutations of grep and cut but can’t get anything to work – the closest I’ve come is cut -c 8-19 file.csv, which just isolates the 2nd and 3rd columns. I have a feeling needing to do this across columns and needing to replace the value instead of just delete the whole line makes this complicated enough to require awk or sed, and I don’t know enough about either to know how to approach this.

MEDevel.com: Open-source for Healthcare and Education

Collecting and validating open-source software for healthcare, education, enterprise, development, medical imaging, medical records, and digital pathology.

Visit Medevel

>Solution :

Using sed

$ sed -E 's/([^,]*,([^,]*),)\2/\1,/' input_file
0000001,11111,66666
0000002,12121,22222
0000003,33333,,
0000004,74747,44444
0000005,12345,,
Add a comment

Leave a Reply

Keep Up to Date with the Most Important News

By pressing the Subscribe button, you confirm that you have read and are agreeing to our Privacy Policy and Terms of Use

Discover more from Dev solutions

Subscribe now to keep reading and get access to the full archive.

Continue reading