Home How to drop identical columns in Pandas dataframe if first x rows of values are identical?

Questions

How to drop identical columns in Pandas dataframe if first x rows of values are identical?

October 24, 2022

I’m working with a large dataset (921600 rows, 23 columns) with the occasional duplicate column (different column names however). I would like to remove the columns with identical values. However, ‘df.T.drop_duplicates().T’ and similar solutions simply take too long as they presumably are checking all 921600 rows. Is it possible to remove columns if just the first x amount of rows have identical values?

E.g.: Identify that ‘channel2’ and ‘channel2-2’ are duplicate by comparing the first x (say 10) rows instead of inspecting all million rows.

           channel1 channel2 channel3 channel2-b
0                47       46       27         46
1                84       28       28         28
2                72       79       68         79
...             ...      ...      ...        ...
999997         4729     1957     2986       1957
999998         9918     1513     2957       1513
999999         1001     5883     7577       5883

>Solution :

Use DataFrame.duplicated with filter top values in DataFrame.head, filter rows by DataFrame.loc:

N = 2
df = df.loc[:, ~df.head(N).T.duplicated()]

duplicates

byMR

Published October 24, 2022

Add a comment

CSS Button Animation Not Starting

byMR

October 24, 2022

Questions

Ignore click if focus is on button in element

byMR

October 24, 2022

Questions

EkoJitsiPluginActivity.kt (224, 28) : Type mismatch: inferred type is String? but String was expected

byMR

October 24, 2022

Questions

How would I return a View compliant to a protocol from SwiftUI View body?

byMR

October 24, 2022

Questions

Why isn't my custom errorbar function working in R?

byMR

October 24, 2022

Questions

How do CLR Enumerables downcast Generics

byMR

October 24, 2022

How to drop identical columns in Pandas dataframe if first x rows of values are identical?