How to identify groups based on two columns and ignore duplicates?

December 6, 2023

My dataframe is this :

df = pd.DataFrame({'col1': ['A', 'A', 'A', 'A', 'B', 'C', 'C'],
 'col2': ['a1', 'a1', 'a2', 'a2', 'b1', 'c1', 'c2']})

print(df)

  col1 col2
0    A   a1
1    A   a1
2    A   a2
3    A   a2
4    B   b1
5    C   c1
6    C   c2

I want to add a column that will tell the rank of the column col2 in each value of col1.

I tried three codes but none of them gives the correct result :

df['col3_1'] = df.groupby(['col1', 'col2']).cumcount() + 1
df['col3_2'] = df.groupby(['col1'])['col2'].cumcount() + 1
df['col3_3'] = df.groupby(['col1', 'col2']).ngroup() + 1

print(df)

  col1 col2  col3_1  col3_2  col3_3
0    A   a1       1       1       1
1    A   a1       2       2       1
2    A   a2       1       3       2
3    A   a2       2       4       2
4    B   b1       1       1       3
5    C   c1       1       1       4
6    C   c2       1       2       5

PS : the real values of the column col2 doesn’t have a pattern letter-number.

My expected output is this :

  col1 col2  col3
0    A   a1     1
1    A   a1     1
2    A   a2     2
3    A   a2     2
4    B   b1     1
5    C   c1     1
6    C   c2     2

Can you guys show me how to do it ?

>Solution :

I think this does what you want:

df['col3'] = df.groupby(['col1'])['col2'].transform(lambda x: pd.factorize(x)[0] + 1)

pd.factorize assigns an integer to each unique value, it returns a tuple of the integer codes and the values. We add 1 because it starts indexing from 0.