Follow

Keep Up to Date with the Most Important News

By pressing the Subscribe button, you confirm that you have read and are agreeing to our Privacy Policy and Terms of Use
Contact

Most efficient way to split up a dataframe into smaller dataframes

I am writing a python program that will parse a large dataframe (tens of thousands of lines) into smaller dataframes based on a column value, and it needs to be fairly efficient, because the user can change the ways they break up the dataframe, and I would like the output to update dynamically.

Example input:

id Column_1 Column_2
1 Oct 10000$
1 Dec 9000$
2 Oct 3400$
3 Dec 20000$
2 Nov 9000$
1 Nov 15000$

Example Output:

MEDevel.com: Open-source for Healthcare and Education

Collecting and validating open-source software for healthcare, education, enterprise, development, medical imaging, medical records, and digital pathology.

Visit Medevel

id Column_1 Column_2
1 Oct 10000$
1 Nov 15000$
1 Dec 9000$
id Column_1 Column_2
2 Oct 3400$
2 Nov 9000$
id Column_1 Column_2
3 Dec 20000$

The naïve way, in my mind, is to do something like this:

for id in list(df['id'].unique()):
    filtered_df = df[df['id'] == id]  

But I believe this would be looping over the same data more times than is necessary, which is inefficient. Is there a fast way of doing this?


Update

Did a little software drag racing. Here are the results:

%%timeit
[df.loc[df.id.eq(i)] for i in df.id.unique()]

9.96 ms ± 1.26 ms per loop (mean ± std. dev. of 7 runs, 100 loops each)

%%timeit
dflist=[]
dflist2=[]
for  k,v  in df.groupby(['id']):
    var='id'+str(k)
    dflist.append(var)
    globals()[var] = v
    dflist2.append(v)

1.28 ms ± 92.2 µs per loop (mean ± std. dev. of 7 runs, 1,000 loops each)

%%timeit
d = {id:df[df.id==id] for id in df.id.unique()}

9.19 ms ± 885 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

Does anyone know why the second solution would be so much faster?

>Solution :

here is one way to do it

# create lists to hold name and the trimmed data from df
dflist=[]
dflist2=[]
for  k,v  in df.groupby(['id']):
    var='id'+str(k)

# append dataframe name in a list
    dflist.append(var)

# create a DF with a variable name
    globals()[var] = v

# append dataframe in a list
    dflist2.append(v)
# to list all the dataframe created 
>>> dflist

['id1', 'id2', 'id3']
# accessing a dataframe
>>> id1


    id  Column_1    Column_2
0   1   Oct     10000$
1   1   Dec     9000$
5   1   Nov     15000$
>> dflist2

[   id Column_1 Column_2
 0   1     Oct    10000$
 1   1     Dec     9000$
 5   1     Nov    15000$,
    id Column_1 Column_2
 2   2     Oct     3400$
 4   2     Nov     9000$,
    id Column_1 Column_2
 3   3     Dec    20000$]
Add a comment

Leave a Reply

Keep Up to Date with the Most Important News

By pressing the Subscribe button, you confirm that you have read and are agreeing to our Privacy Policy and Terms of Use

Discover more from Dev solutions

Subscribe now to keep reading and get access to the full archive.

Continue reading