I have this dataframe below for airline flights:
df = pd.DataFrame({
'ORIGIN_x':["LGA","ORD"],
'DESTINATION_x':["ORD","LGA"],
'number_of_flights_x':[3576,3580],
'ORIGIN_y':["ORD","LGA"],
'DESTINATION_y':["LGA","ORD"],
'number_of_flights_y':[3580,3576]
})
I am trying to get below output:
df_final = pd.DataFrame({
'ORIGIN_x':["LGA"],
'DESTINATION_x':["ORD"],
'ORIGIN_y':["ORD"],
'DESTINATION_y':["LGA"],
'number_of_flights':[7156], #sum of 3576+3580
})
So two sample rows should match and I should see only one row because ORIGIN_x=DESTINATION_y & ORIGIN_y=DESTINATION_x. And I need to sum the number of flights(i.e. sum of number_of_flights_x column)
PS: Sorry if the title of my question is not clear
I tried to do a self-inner join but it is still multiplying the rows.
Thank you!
>Solution :
Given your format with duplicated information, I would drop the _y
and groupby.agg
:
import numpy as np
# only keep _x
tmp = df.filter(like='_x').copy()
# sort ORIGIN/DESTINATION always in the same order
tmp[['ORIGIN_x', 'DESTINATION_x']] = np.sort(tmp[['ORIGIN_x', 'DESTINATION_x']])
# aggregate
out = tmp.groupby(['ORIGIN_x', 'DESTINATION_x'], as_index=False).sum()
print(out)
Output:
ORIGIN_x DESTINATION_x number_of_flights_x
0 LGA ORD 7156
Then, if needed, you can recreate the _y
columns by copying the existing ones (swapping ORIGIN and DESTINATION), but IMO this is not necessary.