Most efficient way to search over a DataFrame in Python

May 13, 2022

I have a DataFrame having these kind of data :

df = pd.DataFrame({
    'id' : ['a', 'a', 'b', 'b', 'c', 'c'],
    'alias' : ['value'+str(i) for i in range(6)],
    'source' : ['src1', 'src2', 'src1', 'src2', 'src1', 'src3']
})

print(df)

output :

  id   alias source
0  a  value0   src1
1  a  value1   src2
2  b  value2   src1
3  b  value3   src2
4  c  value4   src1
5  c  value5   src3

And i want to change the structure of the dataframe to get somthing like this :

     src1    src2    src3
a  value0  value1    None
b  value2  value3    None
c  value4    None  value5

So to process this task i made this function :

def process_aliases(df):
    sources = set(df['source'])
    indexes = list(set(df['id']))

    cols = {source:[] for source in sources}

    for index in indexes:
        subdf = df[df['id']==index]
        for source in sources:
            alias = subdf[df['source'] == source]['alias']
            cols[source].append(alias.iloc[0].strip() if len(alias) == 1 else None)

    return pd.DataFrame(cols, index=indexes)

My problem is that my dataframe contains more than 1 000 000 lines. So this function take too much time to be processed. More than 1 hour of execution time using Google Colab.

I don’t know if my function is the best way to do this. So, i’m asking for help.

What is the best way to accomplish this task ?
Or, what is the fastest way to search for elements over a DataFrame ?

>Solution :

I believe you want to pivot:

import pandas as pd

df = pd.DataFrame({
    'id' : ['a', 'a', 'b', 'b', 'c', 'c'],
    'alias' : ['value'+str(i) for i in range(6)],
    'source' : ['src1', 'src2', 'src1', 'src2', 'src1', 'src3']
})

print(df.pivot(index='id', columns='source'))

         alias                
source    src1    src2    src3
id                            
a       value0  value1     NaN
b       value2  value3     NaN
c       value4     NaN  value5

If you are strictly requiring None instead of NaN for some reason, you can do:

import numpy as np 

print(df.pivot(index='id', columns='source').replace([np.nan], [None]))

         alias                
source    src1    src2    src3
id                            
a       value0  value1    None
b       value2  value3    None
c       value4    None  value5