Follow

Keep Up to Date with the Most Important News

By pressing the Subscribe button, you confirm that you have read and are agreeing to our Privacy Policy and Terms of Use
Contact

pandas.DataFrame.groupby loses index and messes up the data

I have a pandas.DataFrame (named df) with the following data:

          labels               texts
0         labelA  Some Text 12345678
1         labelA  Some Text 12345678
2         labelA  Some Text 12345678
3         labelA  Some Text 12345678
4         labelB  Some Text 12345678
5         labelB  Some Text 12345678
6         labelB  Some Text 12345678
7         labelC  Some Text 12345678
8         labelC  Some Text 12345678
9         labelC  Some Text 12345678
10        labelC  Some Text 12345678
11        labelC  Some Text 12345678
12        labelC  Some Text 12345678

when I perform group by with the following (the goal is to take 2 samples from each label), the index is lost:

grouped = df.groupby('labels')
result = grouped.apply(lambda x: x.sample(n=2))
print(result)

The output becomes:

MEDevel.com: Open-source for Healthcare and Education

Collecting and validating open-source software for healthcare, education, enterprise, development, medical imaging, medical records, and digital pathology.

Visit Medevel

                    labels               texts
labels
labelA    0         labelA  Some Text 12345678
          0         labelA  Some Text 12345678
          0         labelB  Some Text 12345678
          0         labelB  Some Text 12345678
          0         labelC  Some Text 12345678
          0         labelC  Some Text 12345678

I would like the output becomes:

          labels               texts
0         labelA  Some Text 12345678
1         labelA  Some Text 12345678
2         labelB  Some Text 12345678
3         labelB  Some Text 12345678
4         labelC  Some Text 12345678
5         labelC  Some Text 12345678

How should I make the changes?

I tried to use result.dropout(0).reset_index() according to this answer, but it becomes:

     index         labels               texts
0        0         labelA  Some Text 12345678
1        0         labelA  Some Text 12345678
2        0         labelB  Some Text 12345678
3        0         labelB  Some Text 12345678
4        0         labelC  Some Text 12345678
5        0         labelC  Some Text 12345678

>Solution :

Add group_keys parameter to DataFrame.groupby:

grouped = df.groupby('labels', group_keys=False)
result = grouped.apply(lambda x: x.sample(n=2))
print(result)

   labels               texts
0  labelA  Some Text 12345678
1  labelA  Some Text 12345678
4  labelB  Some Text 12345678
6  labelB  Some Text 12345678
9  labelC  Some Text 12345678
8  labelC  Some Text 12345678

Another idea is remove all index and replace by original default RangeIndex:

grouped = df.groupby('labels')
result = grouped.apply(lambda x: x.sample(n=2)).reset_index(drop=True)
print(result)
   labels               texts
0  labelA  Some Text 12345678
1  labelA  Some Text 12345678
2  labelB  Some Text 12345678
3  labelB  Some Text 12345678
4  labelC  Some Text 12345678
5  labelC  Some Text 12345678
Add a comment

Leave a Reply

Keep Up to Date with the Most Important News

By pressing the Subscribe button, you confirm that you have read and are agreeing to our Privacy Policy and Terms of Use

Discover more from Dev solutions

Subscribe now to keep reading and get access to the full archive.

Continue reading