My source data (simplified, contains more columns in reality):
eig_df = pandas.DataFrame({
'KG-EZ': [50101,50101,50101,50101,50103],
'EZ': [17,18,19,19,1],
'Name': ['Name1','Name2','Name3','Name4','Name5']
})
eig_df.set_index(['KG-EZ', 'EZ'], inplace=True)
gst_df = pandas.DataFrame({
'KG-EZ': [50101,50101,50101,50101,50101,50103,50103],
'EZ': [17,17,18,19,19,1,1],
'GST-NR': ['GST1','GST2','GST3','GST4','GST5','GST6','GST7']
})
gst_df.set_index(['KG-EZ', 'EZ'], inplace=True)
I have 2 DataFrames, the first represents people, the second represents plots they own.
The data is provided by a municipality, I have no control over it.
KG-EZ
and KG
together are the columns for joining, but are not unique in either DataFrame.
I want to merge them so that the first DataFrame (of people) gets an additional column GST-NR
that contains all matching values from the second DataFrame, comma-separated.
So for the test data, this should be the result:
result_df = pandas.DataFrame({
'KG-EZ': [50101,50101,50101,50101,50103],
'EZ': [17,18,19,19,1],
'Name': ['Name1','Name2','Name3','Name4','Name5'],
'GST-NR': ['GST1,GST2','GST3','GST4,GST5','GST4,GST5','GST6,GST7']
})
result_df.set_index(['KG-EZ', 'EZ'], inplace=True)
I have tried merge()
, but that gives me multiple rows per person, and I tried to generate a new Series using apply()
, but I can’t find the right lambda that will filter the other DataFrame by the current index and give me the comma-separated values.
Please help!
>Solution :
Use DataFrame.join
with Series
created by aggregate join
:
result_df = eig_df.join(gst_df.groupby(['KG-EZ', 'EZ'])['GST-NR'].agg(','.join))
print (result_df)
Name GST-NR
KG-EZ EZ
50101 17 Name1 GST1,GST2
18 Name2 GST3
19 Name3 GST4,GST5
19 Name4 GST4,GST5
50103 1 Name5 GST6,GST7