I have a dataframe like as below
df = pd.DataFrame({
"Name": ["Tim", "Tim", "Tim", "Tim", "Tim",'Jack','Jack','Jack'],
"Status": ["A1", "E1", "B3", "D4", "C90","A1","C90","B3"]
})
The actual order of my status variable is B3 < A1 < D4 < C90 < E1.
So the last value is E1 and 1st value is B3.
I would like to do the below
a) groupby Name
a) order the values based on categorical ordering shown above
c) Keep only the last value (after dropping duplicates based on Name column)
So, I tried the below
df["Status"] = df["Status"].astype("category")
df["Status"] = df["Status"].cat.set_categories(["B3", "A1", "D4", "C90", "E90"], ordered=True)
df = df.sort_values(['Status'])
df_cleaned = df.drop_duplicates(['Status'],keep='last')
But this results in incorrect output.
I expect my output to be like as below (one row for each Name and their latest/last Status value)
Name Status
Tim E1
Jack C90
>Solution :
Add existing categories to list and sorting with remove duplicates by Name column:
df["Status"] = pd.Categorical(df["Status"],
categories=["B3", "A1", "D4", "C90", "E90","E1"],
ordered=True)
df_cleaned = (df.sort_values(['Status'])
.drop_duplicates(['Name'],keep='last')
print (df_cleaned)
Name Status
6 Jack C90
1 Tim E1
If possible some values are not in list of categories also remove missing values:
df_cleaned = (df.dropna(subset=['Status'])
.sort_values(['Status'])
.drop_duplicates(['Name'],keep='last')