working on NLP problem

I ended up with a big features dataset

```
dfMethod
Out[2]:
c0000167 c0000294 c0000545 ... c4721555 c4759703 c4759772
0 0 0 0 ... 0 0 0
1 0 0 0 ... 0 0 0
2 0 0 0 ... 0 0 0
3 0 0 0 ... 0 0 0
4 0 0 0 ... 0 0 0
... ... ... ... ... ... ...
3995 0 0 0 ... 0 0 0
3996 0 0 0 ... 0 0 0
3997 0 0 0 ... 0 0 0
3998 0 0 0 ... 0 0 0
3999 0 0 0 ... 0 0 0
[4000 rows x 14317 columns]
```

I want to remove columns with the smallest repetition (i.e. the columns with the smallest sum of of all records)

so if my columns sum would look like this

```
Sum of c0000167 = 7523
Sum of c0000294 = 8330
Sum of c0000545 = 502
Sum of c4721555 = 51
Sum of c4759703 = 9628
```

in the end, I want to only keep the top 5000 columns based on the sum of each column?

how can I do that?

### >Solution :

Let’s say you have a big dataframe `big_df`

you can get the top columns with the following:

```
N = 5000
big_df[big_df.sum().sort_values(ascending=False).index[:N]]
```

Breaking this down:

```
big_df.sum() # Gives the sums you mentioned
.sort_values(ascending=False) # Sort the sums in descending order
.index # because .sum() defaults to axis=0, the index is your columns
[:N] # grab first N items
```