Merge 2 DataFrames so that DataFrame 1 contains a column with the joined values from a column in DataFrame 2

My source data (simplified, contains more columns in reality):

eig_df = pandas.DataFrame({
    'KG-EZ': [50101,50101,50101,50101,50103],
    'EZ': [17,18,19,19,1],
    'Name': ['Name1','Name2','Name3','Name4','Name5']
eig_df.set_index(['KG-EZ', 'EZ'], inplace=True)

gst_df = pandas.DataFrame({
    'KG-EZ': [50101,50101,50101,50101,50101,50103,50103],
    'EZ': [17,17,18,19,19,1,1],
    'GST-NR': ['GST1','GST2','GST3','GST4','GST5','GST6','GST7']
gst_df.set_index(['KG-EZ', 'EZ'], inplace=True)

I have 2 DataFrames, the first represents people, the second represents plots they own.
The data is provided by a municipality, I have no control over it.
KG-EZ and KG together are the columns for joining, but are not unique in either DataFrame.

I want to merge them so that the first DataFrame (of people) gets an additional column GST-NR that contains all matching values from the second DataFrame, comma-separated.
So for the test data, this should be the result:

result_df = pandas.DataFrame({
    'KG-EZ': [50101,50101,50101,50101,50103],
    'EZ': [17,18,19,19,1],
    'Name': ['Name1','Name2','Name3','Name4','Name5'],
    'GST-NR': ['GST1,GST2','GST3','GST4,GST5','GST4,GST5','GST6,GST7']
result_df.set_index(['KG-EZ', 'EZ'], inplace=True)

I have tried merge(), but that gives me multiple rows per person, and I tried to generate a new Series using apply(), but I can’t find the right lambda that will filter the other DataFrame by the current index and give me the comma-separated values.

Please help!

>Solution :

Use DataFrame.join with Series created by aggregate join:

result_df = eig_df.join(gst_df.groupby(['KG-EZ', 'EZ'])['GST-NR'].agg(','.join))
print (result_df)
           Name     GST-NR
KG-EZ EZ                  
50101 17  Name1  GST1,GST2
      18  Name2       GST3
      19  Name3  GST4,GST5
      19  Name4  GST4,GST5
50103 1   Name5  GST6,GST7

Leave a Reply