Follow

Keep Up to Date with the Most Important News

By pressing the Subscribe button, you confirm that you have read and are agreeing to our Privacy Policy and Terms of Use
Contact

Selectively Combine several dataframes based on conditions in each dataframe

I have three different time based dataframes with 10s of thousands of data points.

df1['time']  = 1, 2, 3, 4, 5
df1['data1'] = 1, 0, 0, 1, 0

df2['time']  = 1, 3, 5, 7, 9 
df2['data2'] = a, b, c, d, e

df3['time']  = 3, 4, 5, 6, 7
df3['data3'] = z, y, x, w, v

I want to combine these dataframes into 1 dataframe only where they have the same time point existing. In the above dataframes only time 3 and 5 exist concurrently across all three so merge those data points into the final dataframe.

df4['time']  = 3, 5
df4['data1'] = 0, 0
df4['data2'] = b, c
df4['data3'] = z, x

I’ve been trying to avoid iterating over the dataframes with if statements because of the numerous data points and the answer in How to iterate over rows in a DataFrame in Pandas from cs95 basically saying to avoid iterating if possible.

MEDevel.com: Open-source for Healthcare and Education

Collecting and validating open-source software for healthcare, education, enterprise, development, medical imaging, medical records, and digital pathology.

Visit Medevel

Am I stuck iterating through the dataframes or is there vectorization/list comprehension method I can follow?

>Solution :

You want to use pd.merge. I think of it like a SQL join, it works very similarly. In the example below I’m doing 2 merges (a.k.a. joins), first with df1 to df2 as an inner join on the time column, then that merged dataframe is merged with df3 using inner join on time again.

import pandas as pd

df1 = pd.DataFrame({'time': [1,2,3,4,5], 'data1': [1,0,0,1,0]})
df2 = pd.DataFrame({'time': [1,3,5,7,9], 'data2': ['a','b','c','d','e']})
df3 = pd.DataFrame({'time': [3,4,5,6,7], 'data3': ['z','y','x','w','v']})
df4 = df1.merge(df2, how='inner', on='time').merge(df3, how='inner', on='time')
df4
Out[211]: 
   time  data1 data2 data3
0     3      0     b     z
1     5      0     c     x

inner join only keeps values that exist in both dataframes. the on field indicates which column(s) need to match to join.

Notice that the final result will include data1, data2, and data3 from each of the columns, if you have columns that are the same name in the datasets and are not included in on then they will be renamed with _df1, _df2, _df3 respectively.

Add a comment

Leave a Reply

Keep Up to Date with the Most Important News

By pressing the Subscribe button, you confirm that you have read and are agreeing to our Privacy Policy and Terms of Use

Discover more from Dev solutions

Subscribe now to keep reading and get access to the full archive.

Continue reading