I have a dataset, df, that looks like this:
| Date | Code | City | State | Population | Quantity | QTDPERCAPITA |
|---|---|---|---|---|---|---|
| 2020-01 | 11001 | Los Angeles | CA | 5000000 | 100000 | 0.02 |
| 2020-02 | 11001 | Los Angeles | CA | 5000000 | 125000 | 0.025 |
| 2020-03 | 11001 | Los Angeles | CA | 5000000 | 135000 | 0.027 |
| 2020-01 | 12002 | Houston | TX | 3000000 | 150000 | 0.05 |
| 2020-02 | 12002 | Houston | TX | 3000000 | 100000 | 0.033 |
| 2020-03 | 12002 | Houston | TX | 3000000 | 200000 | 0.066 |
| … | … | … | … | … | … | … |
| 2021-07 | 11001 | Los Angeles | CA | 5500499 | 340000 | 0.062 |
| 2021-07 | 12002 | Houston | TX | 3250012 | 211000 | 0.065 |
WhereQTDPERCAPITA is simply Quantity/Population. I have multiple cities (4149 to be more precise).
The quantities change according to every month, and so does the population.
I would like to create a new variable that serve as a ranking, ranging from [0,1], where 0 is the city with the lowest QTDPERCAPITA in that month, and 1 is the city with the most quantity per capita in that month. Essentially, I want to create a new column that looks like this:
| Date | Code | City | State | Population | Quantity | QTDPERCAPITA | RANKING |
|---|---|---|---|---|---|---|---|
| 2020-01 | 11001 | Los Angeles | CA | 5000000 | 100000 | 0.02 | 0 |
| 2020-02 | 11001 | Los Angeles | CA | 5000000 | 125000 | 0.025 | 0 |
| 2020-03 | 11001 | Los Angeles | CA | 5000000 | 135000 | 0.027 | 0 |
| 2020-01 | 12002 | Houston | TX | 3000000 | 150000 | 0.05 | 1 |
| 2020-02 | 12002 | Houston | TX | 3000000 | 100000 | 0.033 | 1 |
| 2020-03 | 12002 | Houston | TX | 3000000 | 200000 | 0.066 | 1 |
| … | … | … | … | … | … | … | … |
| 2021-07 | 11001 | Los Angeles | CA | 5500499 | 340000 | 0.062 | 0 |
| 2021-07 | 12002 | Houston | TX | 3250012 | 211000 | 0.065 | 1 |
How can I create this column such that the RANKING changes every month? I was thinking of a for loop that extracts the QTDPERCAPITA for every city on every unique date, and creates a new column, df['RANKING'] with the same date and city.
>Solution :
You can use:
# MinMax scaler: (rank - min) / (max - min)
ranking = lambda x: (x.rank() - 1) / (len(x) - 1)
# Rank between [0, 1] -> 0 the lowest, 1 the highest
df['RANKING'] = df.groupby('Date')['QTDPERCAPITA'].apply(ranking)
# Rank between [1, 4149] -> 1 the lowest, 4149 the highest
# df['RANKING'] = df.groupby('Date')['QTDPERCAPITA'].rank('dense')
Output:
| Date | Code | City | State | Population | Quantity | QTDPERCAPITA | RANKING |
|---|---|---|---|---|---|---|---|
| 2020-01 | 11001 | Los Angeles | CA | 5000000 | 100000 | 0.02 | 0 |
| 2020-02 | 11001 | Los Angeles | CA | 5000000 | 125000 | 0.025 | 0 |
| 2020-03 | 11001 | Los Angeles | CA | 5000000 | 135000 | 0.027 | 0 |
| 2020-01 | 12002 | Houston | TX | 3000000 | 150000 | 0.05 | 1 |
| 2020-02 | 12002 | Houston | TX | 3000000 | 100000 | 0.033 | 1 |
| 2020-03 | 12002 | Houston | TX | 3000000 | 200000 | 0.066 | 1 |
| 2021-07 | 11001 | Los Angeles | CA | 5500499 | 340000 | 0.618 | 1 |
| 2021-07 | 12002 | Houston | TX | 3250012 | 211000 | 0.065 | 0 |