Follow

Keep Up to Date with the Most Important News

By pressing the Subscribe button, you confirm that you have read and are agreeing to our Privacy Policy and Terms of Use
Contact

How to reduce the size of my dataframe in Python?

working on NLP problem

I ended up with a big features dataset

dfMethod
Out[2]: 
      c0000167  c0000294  c0000545  ...  c4721555  c4759703  c4759772
0            0         0         0  ...         0         0         0
1            0         0         0  ...         0         0         0
2            0         0         0  ...         0         0         0
3            0         0         0  ...         0         0         0
4            0         0         0  ...         0         0         0
       ...       ...       ...  ...       ...       ...       ...
3995         0         0         0  ...         0         0         0
3996         0         0         0  ...         0         0         0
3997         0         0         0  ...         0         0         0
3998         0         0         0  ...         0         0         0
3999         0         0         0  ...         0         0         0

[4000 rows x 14317 columns]

I want to remove columns with the smallest repetition (i.e. the columns with the smallest sum of of all records)

MEDevel.com: Open-source for Healthcare and Education

Collecting and validating open-source software for healthcare, education, enterprise, development, medical imaging, medical records, and digital pathology.

Visit Medevel

so if my columns sum would look like this

Sum of c0000167 = 7523
Sum of c0000294 = 8330
Sum of c0000545 = 502
Sum of c4721555 = 51
Sum of c4759703 = 9628

in the end, I want to only keep the top 5000 columns based on the sum of each column?

how can I do that?

>Solution :

Let’s say you have a big dataframe big_df you can get the top columns with the following:

N = 5000
big_df[big_df.sum().sort_values(ascending=False).index[:N]]

Breaking this down:

big_df.sum()  # Gives the sums you mentioned
.sort_values(ascending=False)  # Sort the sums in descending order
.index  # because .sum() defaults to axis=0, the index is your columns
[:N]  # grab first N items
Add a comment

Leave a Reply

Keep Up to Date with the Most Important News

By pressing the Subscribe button, you confirm that you have read and are agreeing to our Privacy Policy and Terms of Use

Discover more from Dev solutions

Subscribe now to keep reading and get access to the full archive.

Continue reading