Follow

Keep Up to Date with the Most Important News

By pressing the Subscribe button, you confirm that you have read and are agreeing to our Privacy Policy and Terms of Use
Contact

Merge 2 DataFrames so that DataFrame 1 contains a column with the joined values from a column in DataFrame 2

My source data (simplified, contains more columns in reality):

eig_df = pandas.DataFrame({
    'KG-EZ': [50101,50101,50101,50101,50103],
    'EZ': [17,18,19,19,1],
    'Name': ['Name1','Name2','Name3','Name4','Name5']
})
eig_df.set_index(['KG-EZ', 'EZ'], inplace=True)

gst_df = pandas.DataFrame({
    'KG-EZ': [50101,50101,50101,50101,50101,50103,50103],
    'EZ': [17,17,18,19,19,1,1],
    'GST-NR': ['GST1','GST2','GST3','GST4','GST5','GST6','GST7']
})
gst_df.set_index(['KG-EZ', 'EZ'], inplace=True)

I have 2 DataFrames, the first represents people, the second represents plots they own.
The data is provided by a municipality, I have no control over it.
KG-EZ and KG together are the columns for joining, but are not unique in either DataFrame.

I want to merge them so that the first DataFrame (of people) gets an additional column GST-NR that contains all matching values from the second DataFrame, comma-separated.
So for the test data, this should be the result:

MEDevel.com: Open-source for Healthcare and Education

Collecting and validating open-source software for healthcare, education, enterprise, development, medical imaging, medical records, and digital pathology.

Visit Medevel

result_df = pandas.DataFrame({
    'KG-EZ': [50101,50101,50101,50101,50103],
    'EZ': [17,18,19,19,1],
    'Name': ['Name1','Name2','Name3','Name4','Name5'],
    'GST-NR': ['GST1,GST2','GST3','GST4,GST5','GST4,GST5','GST6,GST7']
})
result_df.set_index(['KG-EZ', 'EZ'], inplace=True)

I have tried merge(), but that gives me multiple rows per person, and I tried to generate a new Series using apply(), but I can’t find the right lambda that will filter the other DataFrame by the current index and give me the comma-separated values.

Please help!

>Solution :

Use DataFrame.join with Series created by aggregate join:

result_df = eig_df.join(gst_df.groupby(['KG-EZ', 'EZ'])['GST-NR'].agg(','.join))
print (result_df)
           Name     GST-NR
KG-EZ EZ                  
50101 17  Name1  GST1,GST2
      18  Name2       GST3
      19  Name3  GST4,GST5
      19  Name4  GST4,GST5
50103 1   Name5  GST6,GST7
Add a comment

Leave a Reply

Keep Up to Date with the Most Important News

By pressing the Subscribe button, you confirm that you have read and are agreeing to our Privacy Policy and Terms of Use

Discover more from Dev solutions

Subscribe now to keep reading and get access to the full archive.

Continue reading