Follow

Keep Up to Date with the Most Important News

By pressing the Subscribe button, you confirm that you have read and are agreeing to our Privacy Policy and Terms of Use
Contact

How to drop identical columns in Pandas dataframe if first x rows of values are identical?

I’m working with a large dataset (921600 rows, 23 columns) with the occasional duplicate column (different column names however). I would like to remove the columns with identical values. However, ‘df.T.drop_duplicates().T’ and similar solutions simply take too long as they presumably are checking all 921600 rows. Is it possible to remove columns if just the first x amount of rows have identical values?

E.g.: Identify that ‘channel2’ and ‘channel2-2’ are duplicate by comparing the first x (say 10) rows instead of inspecting all million rows.

           channel1 channel2 channel3 channel2-b
0                47       46       27         46
1                84       28       28         28
2                72       79       68         79
...             ...      ...      ...        ...
999997         4729     1957     2986       1957
999998         9918     1513     2957       1513
999999         1001     5883     7577       5883

MEDevel.com: Open-source for Healthcare and Education

Collecting and validating open-source software for healthcare, education, enterprise, development, medical imaging, medical records, and digital pathology.

Visit Medevel

>Solution :

Use DataFrame.duplicated with filter top values in DataFrame.head, filter rows by DataFrame.loc:

N = 2
df = df.loc[:, ~df.head(N).T.duplicated()]
Add a comment

Leave a Reply

Keep Up to Date with the Most Important News

By pressing the Subscribe button, you confirm that you have read and are agreeing to our Privacy Policy and Terms of Use

Discover more from Dev solutions

Subscribe now to keep reading and get access to the full archive.

Continue reading