replace values in a category column based on the order they appear

I have a DataFrame with 1k records similar to the below one. Column B is a group column in sorted order.

import pandas as pd
df = pd.DataFrame([['red', 0], ['green', 0], ['blue', 16],['white', 58],['yellow', 59], ['purple', 71], ['violet', 82],['grey', 82]], columns=['A','B'])
df
      A     B
0   red     0
1   green   0
2   blue    16
3   white   58
4   yellow  59
5   purple  71
6   violet  82
7   grey    82

How could I update column B to get an output like the below one (to use the column as a category later)?

output_df = pd.DataFrame([['red', 'group1'], ['green', 'group1'], ['blue', 'group2'],['white', 'group3'],['yellow', 'group4'], ['purple', 'group5'], ['violet', 'group6'],['grey', 'group6']], columns=['A','B'])
output_df


          A     B
0   red     group1
1   green   group1
2   blue    group2
3   white   group3
4   yellow  group4
5   purple  group5
6   violet  group6
7   grey    group6

>Solution :

You can use .ne(), .shift(), and .cumsum() to get the group numbers, and then use string manipulation to get the desired result:

df["B"] = "group" + df["B"].ne(df["B"].shift(1)).cumsum().astype(str)

This outputs:

        A       B
0     red  group1
1   green  group1
2    blue  group2
3   white  group3
4  yellow  group4
5  purple  group5
6  violet  group6
7    grey  group6

Leave a Reply