Follow

Keep Up to Date with the Most Important News

By pressing the Subscribe button, you confirm that you have read and are agreeing to our Privacy Policy and Terms of Use
Contact

SQL – Need to GROUP BY only one column of SELECT

This is the dataset I’m using:

country_or_area year comm_code commodity flow trade_usd weight_kg quantity_name quantity category
Belgium 2016 920510 Brass-wind instru… Export 571297 3966.0 Number of items 4135.0 92_musical_instru…
Guatemala 2008 660200 Walking-sticks, s… Export 35022 5575.0 Number of items 10089.0 66_umbrellas_walk…
Barbados 2006 220210 Beverage waters, … Re-Export 81058 44458.0 Volume in litres 24113.0 22_beverages_spir…
Tunisia 2016 780411 Lead foil of a th… Import 4658 121.0 Weight in kilograms 121.0 78_lead_and_artic…
Lithuania 1996 560110 Sanitary towels, … Export 76499 5419.0 Weight in kilograms 5419.0 56_wadding_felt_n…

This is the question I need to answer:

The most commercialized commodity (by occurrencies) in 2016, per flow type

MEDevel.com: Open-source for Healthcare and Education

Collecting and validating open-source software for healthcare, education, enterprise, development, medical imaging, medical records, and digital pathology.

Visit Medevel

I need to group by flow only, but I don’t know how can I do it.

query = '''
        SELECT flow, commodity, MAX(quantity) quantity
        FROM (
          SELECT flow, commodity, COUNT(*) quantity
          FROM transactions
          WHERE year = 2016
          GROUP BY flow, commodity
        )
        GROUP BY flow
        '''
spark.sql(query).show(10)

The result I’m expecting is something like this:

[('Export', ('Sweet biscuits, waffles and wafers', 24)),
 ('Import', ('Baking powders, prepared', 27)),
 ('Re-Export', ('Glues or adhesives, prepared nes, package > 1kg', 8)),
 ('Re-Import', ('Footwear,sole rubber/plastic,upper textile, not sport', 5))]

>Solution :

Use window function
This will give you flow, commodity where quantity is maximum, and max of quantity over each distinct commdity

SELECT distinct 
flow, 
first_value(commodity) over(partition by flow order by quantity desc), 
MAX(quantity) over(partition by flow) as quantity_max
FROM (
  SELECT flow, commodity, COUNT(*) quantity
  FROM transactions
  WHERE year = 2016
  GROUP BY flow, commodity
)
Add a comment

Leave a Reply

Keep Up to Date with the Most Important News

By pressing the Subscribe button, you confirm that you have read and are agreeing to our Privacy Policy and Terms of Use

Discover more from Dev solutions

Subscribe now to keep reading and get access to the full archive.

Continue reading