Follow

Keep Up to Date with the Most Important News

By pressing the Subscribe button, you confirm that you have read and are agreeing to our Privacy Policy and Terms of Use
Contact

Vectorization of indexing into df rows

I’d like to vectorize my code and tried

df['results'] = coord.loc[df['a'],'x_coord'] * coord.loc[df['b'],'y_coord']

but it returns the error "ValueError: cannot reindex on an axis with duplicate labels" because df[‘a] and df[‘b’] both contain duplicate values. These cannot be removed because they are the whole point (the df contains coordinates, therefore there are pairs like (1,0), (1,1), (0,1) etc.).

This version using apply works well but is too slow (the actual dfs have closer to a million rows and there are thousands of them to be processed):

MEDevel.com: Open-source for Healthcare and Education

Collecting and validating open-source software for healthcare, education, enterprise, development, medical imaging, medical records, and digital pathology.

Visit Medevel

def calc(a,b):
     result = coord.loc[a,'x_coord'] * coord.loc[b,'y_coord']
     return result
df['results'] = df.apply(lambda row: calc(row['a'],row['b']),axis=1)

Any tips on how to fix the error or other approaches for vectorizing/speeding this bit up are welcome!

>Solution :

This is because either coord or df has duplicated index. You can convert them into numpy arrays:

df['results'] = coord.loc[df['a'],'x_coord'].to_numpy() * coord.loc[df['b'],'y_coord'].to_numpy()
Add a comment

Leave a Reply

Keep Up to Date with the Most Important News

By pressing the Subscribe button, you confirm that you have read and are agreeing to our Privacy Policy and Terms of Use

Discover more from Dev solutions

Subscribe now to keep reading and get access to the full archive.

Continue reading