Follow

Keep Up to Date with the Most Important News

By pressing the Subscribe button, you confirm that you have read and are agreeing to our Privacy Policy and Terms of Use
Contact

groupby on pandas dataframe created with pandas.read_csv

How do you create a dataframe from a csv file and be able to execute the same groupby command that simply works when creating the dataframe by hand?

Step-by-step i tried to do:

I create a pandas dataframe with multiindexed columns:

MEDevel.com: Open-source for Healthcare and Education

Collecting and validating open-source software for healthcare, education, enterprise, development, medical imaging, medical records, and digital pathology.

Visit Medevel

df = pd.DataFrame({
    ('A', 'x'): [1, 2, 3, 4],
    ('A', 'y'): [10, 15, 20, 25],
    ('B', 'x'): [7, 8, 9, 10],
    ('B', 'y'): [1, 1, 2, 2],
})

I then add a new column with some "metadata" that has only one column identifier:

df["C"] = ["F1", "F1", "F2", "F2"]

Results in:

   A       B      C
   x   y   x  y    
0  1  10   7  1  F1
1  2  15   8  1  F1
2  3  20   9  2  F2
3  4  25  10  2  F2

And i group my data on this column "C" – this works:

grouped = df.groupby("C").sum()

    A       B   
    x   y   x  y
C               
F1  3  25  15  2
F2  7  45  19  4

now i store the dataframe as csv file and read it again:

df.to_csv("test.csv")
df2 = pd.read_csv("test.csv", header=[0, 1], index_col=0)

   A       B                     C
   x   y   x  y Unnamed: 5_level_1
0  1  10   7  1                 F1
1  2  15   8  1                 F1
2  3  20   9  2                 F2
3  4  25  10  2                 F2

executing the same groupby command on the read-in dataframe results in an error:

df2.groupby("C").sum()

  File "xxx\py311\Lib\site-packages\pandas\core\groupby\grouper.py", line 980, in get_grouper
    raise ValueError(f"Grouper for '{name}' not 1-dimensional")
ValueError: Grouper for 'C' not 1-dimensional

I just cant seem to work out what i need to do to be able to groupby the column after reading in the data from csv?

>Solution :

Pandas ignores the trailing empty string levels* during the groupby.

grouped.index
# Index(['F1', 'F2'], dtype='object', name='C') <- not a tuple

You can rename your "Unnamed" columns to empty string when importing the data to get the previous behavior:

df2 = (pd.read_csv("test.csv", header=[0, 1], index_col=0)
         .rename(columns=lambda x: '' if x.startswith('Unnamed:') else x)
       )

df2.groupby('C').sum()

Output:

    A       B   
    x   y   x  y
C               
F1  3  25  15  2
F2  7  45  19  4

*This means that a ('C', '', 'X', '') name could be matched by ('C', '', 'X'), but not by 'C'.

Add a comment

Leave a Reply

Keep Up to Date with the Most Important News

By pressing the Subscribe button, you confirm that you have read and are agreeing to our Privacy Policy and Terms of Use

Discover more from Dev solutions

Subscribe now to keep reading and get access to the full archive.

Continue reading