Follow

Keep Up to Date with the Most Important News

By pressing the Subscribe button, you confirm that you have read and are agreeing to our Privacy Policy and Terms of Use
Contact

Add suffixes to duplicate (column) cell values in Pandas depending on another column value (category)

I have a structure like this:

   Data_group Data Value
    Group_x    A     12
    Group_x    A     13
    Group_x    B     3
    Group_x    C     3
    Group_x    C     32
    Group_x    C     23
    Group_y    A     8
    Group_y    A     7
    Group_y    B     13
    Group_y    C     12 
    Group_y    C     13
    Group_y    C     66

I would like to manipulate the Data column in order to have this as the output:

   Data_group Data  Value
    Group_x    A[0]   12
    Group_x    A[1]   13
    Group_x    B      3
    Group_x    C[0]   3
    Group_x    C[1]   32
    Group_x    C[2]   23
    Group_y    A[0]   8
    Group_y    A[1]   7
    Group_y    B      13
    Group_y    C[0]   12 
    Group_y    C[1]   13
    Group_y    C[2]   66    

Note that for every Group the index is reset and when there is just one entry of that Data there is no suffix. The Value column is not involved in the restructure.
I have tried to exploit this:

MEDevel.com: Open-source for Healthcare and Education

Collecting and validating open-source software for healthcare, education, enterprise, development, medical imaging, medical records, and digital pathology.

Visit Medevel

mask = df['Data'].duplicated(keep=False)
df['Data'] += mask.cumcount().add(1).astype(str).radd('_').mask(df['Data'].transform('count')==1,'') 

It will number the Data duplicate but the suffix will increment regardless of the Data Group.
I started to do it with nested if and for but it appears cumbersome and not functional.
Is there a clean way to use Pandas method have that result?

>Solution :

You can use a single groupby with both Data_group/Data columns. Compute the cumcount as string (with brackets) and add only to groups that have more than one element (transform('size').gt(1)):

g = df.groupby(['Data_group', 'Data'])
df.loc[g['Data'].transform('size').gt(1),
       'Data'] += '['+g.cumcount().astype(str)+']'

output:

   Data_group  Data  Value
0     Group_x  A[0]     12
1     Group_x  A[1]     13
2     Group_x     B      3
3     Group_x  C[0]      3
4     Group_x  C[1]     32
5     Group_x  C[2]     23
6     Group_y  A[0]      8
7     Group_y  A[1]      7
8     Group_y     B     13
9     Group_y  C[0]     12
10    Group_y  C[1]     13
11    Group_y  C[2]     66

intermediates:

'['+g.cumcount().astype(str)+']'

0     [0]
1     [1]
2     [0]
3     [0]
4     [1]
5     [2]
6     [0]
7     [1]
8     [0]
9     [0]
10    [1]
11    [2]
dtype: object

g['Data'].transform('size').gt(1)

0      True
1      True
2     False
3      True
4      True
5      True
6      True
7      True
8     False
9      True
10     True
11     True
Name: Data, dtype: bool
Add a comment

Leave a Reply

Keep Up to Date with the Most Important News

By pressing the Subscribe button, you confirm that you have read and are agreeing to our Privacy Policy and Terms of Use

Discover more from Dev solutions

Subscribe now to keep reading and get access to the full archive.

Continue reading