I have a Dataset downloaded from Kaggle for my project, I would like to seek help in creating new columns and assigning its values based on an existing column.
My actual Dataset is complicated, I will give a similar but simpler dataset for easy discussion.
Input:
Month | Fruit | Weight
------- -------- --------
1-2020 | Orange | 0.2
1-2020 | Kiwi | 0.9
2-2020 | Orange | 2.1
2-2020 | Kiwi | 1.4
...... | ..... | ...
To be able to create a required line chart, I need to change this Dataset structure, making "Orange", "Kiwi" new columns with Weight values, so that "Month' is not repeated.
Desired output:
Month | Orange| Kiwi
------- -------- --------
1-2020 | 0.2 | 0.9
2-2020 | 2.1 | 1.4
Any help would be greatly appreciated
>Solution :
The basic thing you’re looking for is a pivot:
import pandas as pd
data = {
'Month': ['1-2020', '1-2020', '2-2020', '2-2020'],
'Fruit': ['Orange', 'Kiwi', 'Orange', 'Kiwi'],
'Weight': [0.2, 0.9, 2.1, 1.4]
}
df = pd.DataFrame(data)
pivot_df = df.pivot_table(index='Month', columns='Fruit', values='Weight')
print(pivot_df)
Output:
Fruit Kiwi Orange
Month
1-2020 0.9 0.2
2-2020 1.4 2.1
If you want Month as a column instead of the index:
pivot_df.reset_index(inplace=True)
pivot_df.columns.name = None
print(pivot_df)
Output:
Month Kiwi Orange
0 1-2020 0.9 0.2
1 2-2020 1.4 2.1
If there are duplicate combinations of Month and Fruit in the original data, and you want to sum them (for example):
data = {
'Month': ['1-2020', '1-2020', '1-2020', '2-2020', '2-2020'],
'Fruit': ['Orange', 'Kiwi', 'Kiwi', 'Orange', 'Kiwi'],
'Weight': [0.2, 0.9, 1.1, 2.1, 1.4]
}
df = pd.DataFrame(data)
pivot_df = df.pivot_table(index='Month', columns='Fruit', values='Weight', aggfunc='sum')
print(pivot_df)
Output:
Fruit Kiwi Orange
Month
1-2020 2.0 0.2
2-2020 1.4 2.1