say I have the following databases (suppose they are Dask data frames:
df A =
1
1
2
2
2
2
3
4
5
5
5
5
5
5
df B =
1
2
2
3
3
3
4
5
5
5
and I would like to merge the two so that the resulting DataFrame has the most information among the two (so for instance in the case of observation 1 I would like to preserve the info of df A, in case of observation number 3, I would like to preserve the info of df B and iso on…).
In other words the resulting DataFrame should be like this:
df C=
1
1
2
2
2
2
3
3
3
4
5
5
5
5
5
5
Is there a way to do that in Dask?
Thank you
>Solution :
Looks like OP wants to use dask.dataframe.DataFrame.merge. Start by importing dask.dataframe and then do the desired merge by changing the how parameter.
import dask.dataframe as dd
df_c = dd.merge(df_a, df_b, how='outer', on='sample_id')
[Out]:
sample_id
0 1
1 1
2 2
3 2
4 2
5 2
6 2
7 2
8 2
9 2
10 3
11 3
12 3
13 4
14 5
15 5
16 5
17 5
18 5
19 5
20 5
21 5
22 5
23 5
24 5
25 5
26 5
27 5
28 5
29 5
30 5
31 5
Note:
- This thread has really valuable information on merges. Even though its focus is on
Pandas, it will allow one to understandleft,right,outer, andinnermerges.