How to reduce the size of my dataframe in Python?

May 15, 2022

working on NLP problem

I ended up with a big features dataset

dfMethod
Out[2]: 
      c0000167  c0000294  c0000545  ...  c4721555  c4759703  c4759772
0            0         0         0  ...         0         0         0
1            0         0         0  ...         0         0         0
2            0         0         0  ...         0         0         0
3            0         0         0  ...         0         0         0
4            0         0         0  ...         0         0         0
       ...       ...       ...  ...       ...       ...       ...
3995         0         0         0  ...         0         0         0
3996         0         0         0  ...         0         0         0
3997         0         0         0  ...         0         0         0
3998         0         0         0  ...         0         0         0
3999         0         0         0  ...         0         0         0

[4000 rows x 14317 columns]

I want to remove columns with the smallest repetition (i.e. the columns with the smallest sum of of all records)

so if my columns sum would look like this

Sum of c0000167 = 7523
Sum of c0000294 = 8330
Sum of c0000545 = 502
Sum of c4721555 = 51
Sum of c4759703 = 9628

in the end, I want to only keep the top 5000 columns based on the sum of each column?

how can I do that?

>Solution :

Let’s say you have a big dataframe big_df you can get the top columns with the following:

N = 5000
big_df[big_df.sum().sort_values(ascending=False).index[:N]]

Breaking this down:

big_df.sum()  # Gives the sums you mentioned
.sort_values(ascending=False)  # Sort the sums in descending order
.index  # because .sum() defaults to axis=0, the index is your columns
[:N]  # grab first N items