SQL – Need to GROUP BY only one column of SELECT

May 27, 2023

This is the dataset I’m using:

country_or_area	year	comm_code	commodity	flow	trade_usd	weight_kg	quantity_name	quantity	category
Belgium	2016	920510	Brass-wind instru…	Export	571297	3966.0	Number of items	4135.0	92_musical_instru…
Guatemala	2008	660200	Walking-sticks, s…	Export	35022	5575.0	Number of items	10089.0	66_umbrellas_walk…
Barbados	2006	220210	Beverage waters, …	Re-Export	81058	44458.0	Volume in litres	24113.0	22_beverages_spir…
Tunisia	2016	780411	Lead foil of a th…	Import	4658	121.0	Weight in kilograms	121.0	78_lead_and_artic…
Lithuania	1996	560110	Sanitary towels, …	Export	76499	5419.0	Weight in kilograms	5419.0	56_wadding_felt_n…

This is the question I need to answer:

The most commercialized commodity (by occurrencies) in 2016, per flow type

I need to group by flow only, but I don’t know how can I do it.

query = '''
        SELECT flow, commodity, MAX(quantity) quantity
        FROM (
          SELECT flow, commodity, COUNT(*) quantity
          FROM transactions
          WHERE year = 2016
          GROUP BY flow, commodity
        )
        GROUP BY flow
        '''
spark.sql(query).show(10)

The result I’m expecting is something like this:

[('Export', ('Sweet biscuits, waffles and wafers', 24)),
 ('Import', ('Baking powders, prepared', 27)),
 ('Re-Export', ('Glues or adhesives, prepared nes, package > 1kg', 8)),
 ('Re-Import', ('Footwear,sole rubber/plastic,upper textile, not sport', 5))]

>Solution :

Use window function
This will give you flow, commodity where quantity is maximum, and max of quantity over each distinct commdity

SELECT distinct 
flow, 
first_value(commodity) over(partition by flow order by quantity desc), 
MAX(quantity) over(partition by flow) as quantity_max
FROM (
  SELECT flow, commodity, COUNT(*) quantity
  FROM transactions
  WHERE year = 2016
  GROUP BY flow, commodity
)