Merge two dataframes based on new column(s) values

January 13, 2022

I have two Dataframes

df1 = pd.DataFrame(
    {
        "A": ["1", "3", "22", "43"],
        "B": ["6", "19", "4", "31"],
        "C": ["47", "15", "8", "19"],
    },
    index=[0, 1, 2, 3],
)


df2 = pd.DataFrame(
    {
        "A": ["65", "47", "6", "13"],
        "B": ["29", "5", "2", "21"],
        "C": ["69", "9", "11", "80"],
    },
    index=[4, 5, 6, 7],
)

By using pandas, the final result should be:

    A   B   C   Ti  ID
0   1   6   47  am  01
1   3   19  15  am  01
2   22  4   8   am  01
3   43  31  19  am  01
4   65  29  69  pm  01
5   47  5   9   pm  01
6   6   2   11  pm  01
7   13  21  80  pm  01

I went through Pandas Documentation, and I am trying to merge these two Dataframes by using pd.concat. The code is:

new_df = pd.concat([df1, df2], keys=['am', 'pm']).reset_index()

However, the new Dataframe came out with an extra column level_1 that I don’t want it to be there:

    level_0  level_1  A    B    C
0   am       0        1    6    47
1   am       1        3    19   15
2   am       2        22   4    8
3   am       3        43   31   19
4   pm       4        65   29   69
5   pm       5        47   5    9
6   pm       6        6    2    11
7   pm       7        13   21   80

I know reset_index() created that unwanted column. But why?

What else to do to get the same final Dataframe using pandas?

>Solution :

After you concatenate the DataFrames,

new_df = pd.concat([df1, df2], keys=['am', 'pm'])

new_df looks like

       A   B   C
am 0   1   6  47
   1   3  19  15
   2  22   4   8
   3  43  31  19
pm 4  65  29  69
   5  47   5   9
   6   6   2  11
   7  13  21  80

If we look at the index new_df.index, it’s a MultiIndex where the first level is the keys and second level is the old index:

MultiIndex([('am', 0),
            ('am', 1),
            ('am', 2),
            ('am', 3),
            ('pm', 4),
            ('pm', 5),
            ('pm', 6),
            ('pm', 7)],
           )

Then first we can rename the MultiIndex levels using rename_axis, and reset_index but only remove the first level from the index (which then becomes a column with its name). Note that by default, reset_index removes all levels from the index. That’s why you see level_0 and level_1 columns added after reset_index.

new_df = new_df.rename_axis(['Ti', None]).reset_index(level=0)

You can rearrange the columns by reassigning the DataFrame with a list of the columns with the desired order.

cols = new_df.columns.tolist()
new_df = new_df[cols[1:]+[cols[0]]]
new_df['ID'] = '01'

Output:

    A   B   C  Ti  ID
0   1   6  47  am  01
1   3  19  15  am  01
2  22   4   8  am  01
3  43  31  19  am  01
4  65  29  69  pm  01
5  47   5   9  pm  01
6   6   2  11  pm  01
7  13  21  80  pm  01