Pandas: How to self-join cross-functionally

I have this dataframe below for airline flights:

df = pd.DataFrame({
    
    'ORIGIN_x':["LGA","ORD"],
    'DESTINATION_x':["ORD","LGA"],
    'number_of_flights_x':[3576,3580],
    'ORIGIN_y':["ORD","LGA"],
    'DESTINATION_y':["LGA","ORD"],
    'number_of_flights_y':[3580,3576]
    
})

I am trying to get below output:

df_final = pd.DataFrame({
    
    'ORIGIN_x':["LGA"],
    'DESTINATION_x':["ORD"],
    'ORIGIN_y':["ORD"],
    'DESTINATION_y':["LGA"],
    'number_of_flights':[7156], #sum of 3576+3580
    
})

So two sample rows should match and I should see only one row because ORIGIN_x=DESTINATION_y & ORIGIN_y=DESTINATION_x. And I need to sum the number of flights(i.e. sum of number_of_flights_x column)

PS: Sorry if the title of my question is not clear

I tried to do a self-inner join but it is still multiplying the rows.

Thank you!

>Solution :

Given your format with duplicated information, I would drop the _y and groupby.agg:

import numpy as np

# only keep _x
tmp = df.filter(like='_x').copy()

# sort ORIGIN/DESTINATION always in the same order
tmp[['ORIGIN_x', 'DESTINATION_x']] = np.sort(tmp[['ORIGIN_x', 'DESTINATION_x']])

# aggregate
out = tmp.groupby(['ORIGIN_x', 'DESTINATION_x'], as_index=False).sum()

print(out)

Output:

  ORIGIN_x DESTINATION_x  number_of_flights_x
0      LGA           ORD                 7156

Then, if needed, you can recreate the _y columns by copying the existing ones (swapping ORIGIN and DESTINATION), but IMO this is not necessary.

Leave a Reply