Follow

Keep Up to Date with the Most Important News

By pressing the Subscribe button, you confirm that you have read and are agreeing to our Privacy Policy and Terms of Use
Contact

Proper way to do this in pandas without using for loop

The question is I would like to avoid iterrows here.

From my dataframe I want to create a new column "unique" that will be based on the condition that if "a" and "b" column values are the same I would give it a value "uniqueN" then for all occurrence of the exact "a" and "b" I would need the same value "uniqueN".

In this case

MEDevel.com: Open-source for Healthcare and Education

Collecting and validating open-source software for healthcare, education, enterprise, development, medical imaging, medical records, and digital pathology.

Visit Medevel

  • "1", "3" (the first row) from "a" and "b" is the first unique pair, so I give that the value "unique1", and the seventh row will also have the same value which is "unique1" as it is also "1", "1".

  • "2", "2" (the second row) is the next unique "a", "b" pair so I give them "unique2" and the eight row also has "2", "2" so that will also have "unique2".

  • "3", "1" (third row) is the next unique, so "unique3", no more rows in the df is "3", "1" so that value wont repeat.

  • and so on

I have a working code that uses loops but this is not the pandas way, can anyone suggest how I can do this using pandas functions?

Expected Output (My code works, but its not using pandas methods)

   a  b   unique
0  1  3  unique1
1  2  2  unique2
2  3  1  unique3
3  4  2  unique4
4  3  3  unique5
5  4  2  unique4
6  1  3  unique1
7  2  2  unique2

Code

import pandas as pd

df = pd.DataFrame({'a': [1, 2, 3, 4, 3, 4, 1, 2], 'b': [3, 2, 1, 2, 3, 2, 3, 2]})

c = 1
seen = {}
for i, j in df.iterrows():
    j = tuple(j)
    if j not in seen:
        seen[j] = 'unique' + str(c)
        c += 1

for key, value in seen.items():
    df.loc[(df.a == key[0]) & (df.b == key[1]), 'unique'] = value

>Solution :

Let’s use groupby ngroup with sort=False to ensure values are enumerated in order of appearance, add 1 so group numbers start at one, then convert to string with astype so we can add the prefix unique to the number:

df['unique'] = 'unique' + \
               df.groupby(['a', 'b'], sort=False).ngroup().add(1).astype(str)

Or with map and format instead of converting and concatenating:

df['unique'] = (
    df.groupby(['a', 'b'], sort=False).ngroup()
        .add(1)
        .map('unique{}'.format)
)

df:

   a  b   unique
0  1  3  unique1
1  2  2  unique2
2  3  1  unique3
3  4  2  unique4
4  3  3  unique5
5  4  2  unique4
6  1  3  unique1
7  2  2  unique2

Setup:

import pandas as pd

df = pd.DataFrame({
    'a': [1, 2, 3, 4, 3, 4, 1, 2], 'b': [3, 2, 1, 2, 3, 2, 3, 2]
})
Add a comment

Leave a Reply

Keep Up to Date with the Most Important News

By pressing the Subscribe button, you confirm that you have read and are agreeing to our Privacy Policy and Terms of Use

Discover more from Dev solutions

Subscribe now to keep reading and get access to the full archive.

Continue reading