How do you create a dataframe from a csv file and be able to execute the same groupby command that simply works when creating the dataframe by hand?
Step-by-step i tried to do:
I create a pandas dataframe with multiindexed columns:
df = pd.DataFrame({
('A', 'x'): [1, 2, 3, 4],
('A', 'y'): [10, 15, 20, 25],
('B', 'x'): [7, 8, 9, 10],
('B', 'y'): [1, 1, 2, 2],
})
I then add a new column with some "metadata" that has only one column identifier:
df["C"] = ["F1", "F1", "F2", "F2"]
Results in:
A B C
x y x y
0 1 10 7 1 F1
1 2 15 8 1 F1
2 3 20 9 2 F2
3 4 25 10 2 F2
And i group my data on this column "C" – this works:
grouped = df.groupby("C").sum()
A B
x y x y
C
F1 3 25 15 2
F2 7 45 19 4
now i store the dataframe as csv file and read it again:
df.to_csv("test.csv")
df2 = pd.read_csv("test.csv", header=[0, 1], index_col=0)
A B C
x y x y Unnamed: 5_level_1
0 1 10 7 1 F1
1 2 15 8 1 F1
2 3 20 9 2 F2
3 4 25 10 2 F2
executing the same groupby command on the read-in dataframe results in an error:
df2.groupby("C").sum()
File "xxx\py311\Lib\site-packages\pandas\core\groupby\grouper.py", line 980, in get_grouper
raise ValueError(f"Grouper for '{name}' not 1-dimensional")
ValueError: Grouper for 'C' not 1-dimensional
I just cant seem to work out what i need to do to be able to groupby the column after reading in the data from csv?
>Solution :
Pandas ignores the trailing empty string levels* during the groupby.
grouped.index
# Index(['F1', 'F2'], dtype='object', name='C') <- not a tuple
You can rename your "Unnamed" columns to empty string when importing the data to get the previous behavior:
df2 = (pd.read_csv("test.csv", header=[0, 1], index_col=0)
.rename(columns=lambda x: '' if x.startswith('Unnamed:') else x)
)
df2.groupby('C').sum()
Output:
A B
x y x y
C
F1 3 25 15 2
F2 7 45 19 4
*This means that a ('C', '', 'X', '') name could be matched by ('C', '', 'X'), but not by 'C'.