Follow

Keep Up to Date with the Most Important News

By pressing the Subscribe button, you confirm that you have read and are agreeing to our Privacy Policy and Terms of Use
Contact

Most efficient way to search over a DataFrame in Python

I have a DataFrame having these kind of data :

df = pd.DataFrame({
    'id' : ['a', 'a', 'b', 'b', 'c', 'c'],
    'alias' : ['value'+str(i) for i in range(6)],
    'source' : ['src1', 'src2', 'src1', 'src2', 'src1', 'src3']
})

print(df)

output :

  id   alias source
0  a  value0   src1
1  a  value1   src2
2  b  value2   src1
3  b  value3   src2
4  c  value4   src1
5  c  value5   src3

And i want to change the structure of the dataframe to get somthing like this :

MEDevel.com: Open-source for Healthcare and Education

Collecting and validating open-source software for healthcare, education, enterprise, development, medical imaging, medical records, and digital pathology.

Visit Medevel

     src1    src2    src3
a  value0  value1    None
b  value2  value3    None
c  value4    None  value5

So to process this task i made this function :

def process_aliases(df):
    sources = set(df['source'])
    indexes = list(set(df['id']))

    cols = {source:[] for source in sources}

    for index in indexes:
        subdf = df[df['id']==index]
        for source in sources:
            alias = subdf[df['source'] == source]['alias']
            cols[source].append(alias.iloc[0].strip() if len(alias) == 1 else None)

    return pd.DataFrame(cols, index=indexes)

My problem is that my dataframe contains more than 1 000 000 lines. So this function take too much time to be processed. More than 1 hour of execution time using Google Colab.

I don’t know if my function is the best way to do this. So, i’m asking for help.

What is the best way to accomplish this task ?
Or, what is the fastest way to search for elements over a DataFrame ?

>Solution :

I believe you want to pivot:

import pandas as pd

df = pd.DataFrame({
    'id' : ['a', 'a', 'b', 'b', 'c', 'c'],
    'alias' : ['value'+str(i) for i in range(6)],
    'source' : ['src1', 'src2', 'src1', 'src2', 'src1', 'src3']
})

print(df.pivot(index='id', columns='source'))
         alias                
source    src1    src2    src3
id                            
a       value0  value1     NaN
b       value2  value3     NaN
c       value4     NaN  value5

If you are strictly requiring None instead of NaN for some reason, you can do:

import numpy as np 

print(df.pivot(index='id', columns='source').replace([np.nan], [None]))
         alias                
source    src1    src2    src3
id                            
a       value0  value1    None
b       value2  value3    None
c       value4    None  value5
Add a comment

Leave a Reply

Keep Up to Date with the Most Important News

By pressing the Subscribe button, you confirm that you have read and are agreeing to our Privacy Policy and Terms of Use

Discover more from Dev solutions

Subscribe now to keep reading and get access to the full archive.

Continue reading