The question is I would like to avoid iterrows here.
From my dataframe I want to create a new column "unique" that will be based on the condition that if "a" and "b" column values are the same I would give it a value "uniqueN" then for all occurrence of the exact "a" and "b" I would need the same value "uniqueN".
In this case
-
"1", "3" (the first row) from "a" and "b" is the first unique pair, so I give that the value "unique1", and the seventh row will also have the same value which is "unique1" as it is also "1", "1".
-
"2", "2" (the second row) is the next unique "a", "b" pair so I give them "unique2" and the eight row also has "2", "2" so that will also have "unique2".
-
"3", "1" (third row) is the next unique, so "unique3", no more rows in the df is "3", "1" so that value wont repeat.
-
and so on
I have a working code that uses loops but this is not the pandas way, can anyone suggest how I can do this using pandas functions?
Expected Output (My code works, but its not using pandas methods)
a b unique
0 1 3 unique1
1 2 2 unique2
2 3 1 unique3
3 4 2 unique4
4 3 3 unique5
5 4 2 unique4
6 1 3 unique1
7 2 2 unique2
Code
import pandas as pd
df = pd.DataFrame({'a': [1, 2, 3, 4, 3, 4, 1, 2], 'b': [3, 2, 1, 2, 3, 2, 3, 2]})
c = 1
seen = {}
for i, j in df.iterrows():
j = tuple(j)
if j not in seen:
seen[j] = 'unique' + str(c)
c += 1
for key, value in seen.items():
df.loc[(df.a == key[0]) & (df.b == key[1]), 'unique'] = value
>Solution :
Let’s use groupby ngroup with sort=False to ensure values are enumerated in order of appearance, add 1 so group numbers start at one, then convert to string with astype so we can add the prefix unique to the number:
df['unique'] = 'unique' + \
df.groupby(['a', 'b'], sort=False).ngroup().add(1).astype(str)
Or with map and format instead of converting and concatenating:
df['unique'] = (
df.groupby(['a', 'b'], sort=False).ngroup()
.add(1)
.map('unique{}'.format)
)
df:
a b unique
0 1 3 unique1
1 2 2 unique2
2 3 1 unique3
3 4 2 unique4
4 3 3 unique5
5 4 2 unique4
6 1 3 unique1
7 2 2 unique2
Setup:
import pandas as pd
df = pd.DataFrame({
'a': [1, 2, 3, 4, 3, 4, 1, 2], 'b': [3, 2, 1, 2, 3, 2, 3, 2]
})