This is the dataset I’m using:
| country_or_area | year | comm_code | commodity | flow | trade_usd | weight_kg | quantity_name | quantity | category |
|---|---|---|---|---|---|---|---|---|---|
| Belgium | 2016 | 920510 | Brass-wind instru… | Export | 571297 | 3966.0 | Number of items | 4135.0 | 92_musical_instru… |
| Guatemala | 2008 | 660200 | Walking-sticks, s… | Export | 35022 | 5575.0 | Number of items | 10089.0 | 66_umbrellas_walk… |
| Barbados | 2006 | 220210 | Beverage waters, … | Re-Export | 81058 | 44458.0 | Volume in litres | 24113.0 | 22_beverages_spir… |
| Tunisia | 2016 | 780411 | Lead foil of a th… | Import | 4658 | 121.0 | Weight in kilograms | 121.0 | 78_lead_and_artic… |
| Lithuania | 1996 | 560110 | Sanitary towels, … | Export | 76499 | 5419.0 | Weight in kilograms | 5419.0 | 56_wadding_felt_n… |
This is the question I need to answer:
The most commercialized commodity (by occurrencies) in 2016, per flow type
I need to group by flow only, but I don’t know how can I do it.
query = '''
SELECT flow, commodity, MAX(quantity) quantity
FROM (
SELECT flow, commodity, COUNT(*) quantity
FROM transactions
WHERE year = 2016
GROUP BY flow, commodity
)
GROUP BY flow
'''
spark.sql(query).show(10)
The result I’m expecting is something like this:
[('Export', ('Sweet biscuits, waffles and wafers', 24)),
('Import', ('Baking powders, prepared', 27)),
('Re-Export', ('Glues or adhesives, prepared nes, package > 1kg', 8)),
('Re-Import', ('Footwear,sole rubber/plastic,upper textile, not sport', 5))]
>Solution :
Use window function
This will give you flow, commodity where quantity is maximum, and max of quantity over each distinct commdity
SELECT distinct
flow,
first_value(commodity) over(partition by flow order by quantity desc),
MAX(quantity) over(partition by flow) as quantity_max
FROM (
SELECT flow, commodity, COUNT(*) quantity
FROM transactions
WHERE year = 2016
GROUP BY flow, commodity
)