I am performing analysis on this dataset.
After using the code below, I am left with the cleaned version of the data.
covid_df.drop(columns = ["Sno", "Time"], inplace = True)
covid_df["State/UnionTerritory"] = covid_df["State/UnionTerritory"].replace({
"Bihar****": "Bihar",
"Maharashtra***": "Maharashtra",
"Madhya Pradesh***": "Madhya Pradesh",
"Karanataka": "Karnataka",
"Telangana": "Telengana",
"Himanchal Pradesh": "Himachal Pradesh",
"Dadra and Nagar Haveli": "Dadra and Nagar Haveli and Daman and Diu",
"Daman & Diu": "Dadra and Nagar Haveli and Daman and Diu"
})
invalid_states = ["Cases being reassigned to states", "Unassigned"]
for invalid_state in invalid_states:
invalid_state_index = covid_df.loc[covid_df["State/UnionTerritory"] == invalid_state, :].index
covid_df.drop(index = invalid_state_index, inplace = True)
covid_df = covid_df.groupby(["State/UnionTerritory", "Date"], as_index = False).sum()
covid_df["Date"] = pd.to_datetime(covid_df["Date"])
covid_df.sort_values(by = ["State/UnionTerritory", "Date"], inplace = True)
This cleaned data has the cumulative cases for each State/UnionTerritory for each Date. How can I extract the daily new cases for each State/UnionTerritory?
One way I have in mind is by subtracting the previous row from the current row if both have the same State/UnionTerritory value. What would be the most efficient way to do this?
Would be very appreciated if you could suggest a better way of cleaning the data.
>Solution :
You could use shift.
For example:
df = pd.DataFrame({'cumul': [0, 2, 3, 5, 7]})
df['quantity'] = df - df.shift(1)
quantity will be:
quantity
0 NaN
1 2.0
2 1.0
3 2.0
4 2.0
You can then fillna or just change the zeroth value in quantity for the zeroth value in cumul.
Edit: prepare the dataframe by applying your conditions first 🙂