I Have a data frame with the column name and I need to create the column seq, which allows me identify the different times that a name appears in the data frame, it’s important to preserve the order.
import pandas as pd
data = {'name': ['Tom', 'Joseph','Joseph','Joseph', 'Tom', 'Tom', 'John','Tom','Tom','John','Joseph']
, 'seq': ['Tom 0', 'Joseph 0','Joseph 0','Joseph 0', 'Tom 1', 'Tom 1', 'John 0','Tom 2','Tom 2','John 1','Joseph 1']}
df = pd.DataFrame(data)
print(df)
name seq
0 Tom Tom 0
1 Joseph Joseph 0
2 Joseph Joseph 0
3 Joseph Joseph 0
4 Tom Tom 1
5 Tom Tom 1
6 John John 0
7 Tom Tom 2
8 Tom Tom 2
9 John John 1
10 Joseph Joseph 1
>Solution :
Create a boolean mask to know if the name has changed from the previous row. Then filter out the second, third, … names of a sequence before grouping by name. cumcount increment the sequence number and finally concatenate name and sequence number.
# Boolean mask
m = df['name'].ne(df['name'].shift())
# Create sequence number
seq = df.loc[m].groupby('name').cumcount().astype(str) \
.reindex(df.index, fill_value=pd.NA).ffill()
# Concatenate name and seq
df['seq'] = df['name'] + ' ' + seq
Output:
>>> df
name seq
0 Tom Tom 0
1 Joseph Joseph 0
2 Joseph Joseph 0
3 Joseph Joseph 0
4 Tom Tom 1
5 Tom Tom 1
6 John John 0
7 Tom Tom 2
8 Tom Tom 2
9 John John 1
10 Joseph Joseph 1
>>> m
0 True
1 True
2 False
3 False
4 True
5 False
6 True
7 True
8 False
9 True
10 True
Name: name, dtype: bool