Drop duplicates, but sum values into one

I have a dataframe that contains values for corn production values (maiz_a) across the U.S. The data is organized by latitude and longitude (y and x), but also by state and county. Many of the specific latitudes and longitudes fall within the same county, which gives many duplicate entries for the same state and county, but with different values. Here is an example:

         x          y           maiz_a  maiz_pct    name_adm1   name_adm2
62748   -89.875000  41.625000   49642.6 0.000155    Illinois    Whiteside
29499   -98.041667  40.791667   49621.1 0.000309    Nebraska    Hamilton
28101   -98.208333  40.791667   48866.9 0.000462    Nebraska    Hamilton
24910   -98.041667  40.958333   48741.3 0.000614    Nebraska    Hamilton
28615   -97.958333  40.958333   48538.8 0.000765    Nebraska    Hamilton
24697   -97.875000  40.791667   48536.3 0.000916    Nebraska    Hamilton
64127   -89.791667  41.625000   48448.2 0.001067    Illinois    Whiteside

Notice how there are many name_adm1 and name_adm2 of the combination "Nebraska" and "Hamilton". I would like to drop the duplicates, but sum all of said duplicates into one line based on maiz_a and maiz_pct. Additionally, the newly formatted dataframe should only contain the first instance of x and y and obviously should retain name_adm1 and name_adm2.

>Solution :

use groupby and agg
This allows you to specify how you want to aggregate each specific column.

df.groupby(['name_adm1', 'name_adm2'], as_index=False).agg(
    {'x': 'first', 'y': 'first', 'maiz_a': 'sum', 'maiz_pct': 'sum'}

  name_adm1  name_adm2          x          y    maiz_a  maiz_pct
0  Illinois  Whiteside -89.875000  41.625000   98090.8  0.001222
1  Nebraska   Hamilton -98.041667  40.791667  244304.4  0.003066

Leave a Reply