Follow

Keep Up to Date with the Most Important News

By pressing the Subscribe button, you confirm that you have read and are agreeing to our Privacy Policy and Terms of Use
Contact

How to create a ranking variable/function for different periods in a panel data?

I have a dataset, df, that looks like this:

Date Code City State Population Quantity QTDPERCAPITA
2020-01 11001 Los Angeles CA 5000000 100000 0.02
2020-02 11001 Los Angeles CA 5000000 125000 0.025
2020-03 11001 Los Angeles CA 5000000 135000 0.027
2020-01 12002 Houston TX 3000000 150000 0.05
2020-02 12002 Houston TX 3000000 100000 0.033
2020-03 12002 Houston TX 3000000 200000 0.066
2021-07 11001 Los Angeles CA 5500499 340000 0.062
2021-07 12002 Houston TX 3250012 211000 0.065

WhereQTDPERCAPITA is simply Quantity/Population. I have multiple cities (4149 to be more precise).

The quantities change according to every month, and so does the population.

MEDevel.com: Open-source for Healthcare and Education

Collecting and validating open-source software for healthcare, education, enterprise, development, medical imaging, medical records, and digital pathology.

Visit Medevel

I would like to create a new variable that serve as a ranking, ranging from [0,1], where 0 is the city with the lowest QTDPERCAPITA in that month, and 1 is the city with the most quantity per capita in that month. Essentially, I want to create a new column that looks like this:

Date Code City State Population Quantity QTDPERCAPITA RANKING
2020-01 11001 Los Angeles CA 5000000 100000 0.02 0
2020-02 11001 Los Angeles CA 5000000 125000 0.025 0
2020-03 11001 Los Angeles CA 5000000 135000 0.027 0
2020-01 12002 Houston TX 3000000 150000 0.05 1
2020-02 12002 Houston TX 3000000 100000 0.033 1
2020-03 12002 Houston TX 3000000 200000 0.066 1
2021-07 11001 Los Angeles CA 5500499 340000 0.062 0
2021-07 12002 Houston TX 3250012 211000 0.065 1

How can I create this column such that the RANKING changes every month? I was thinking of a for loop that extracts the QTDPERCAPITA for every city on every unique date, and creates a new column, df['RANKING'] with the same date and city.

>Solution :

You can use:

# MinMax scaler: (rank - min) / (max - min)
ranking = lambda x: (x.rank() - 1) / (len(x) - 1)

# Rank between [0, 1] -> 0 the lowest, 1 the highest
df['RANKING'] = df.groupby('Date')['QTDPERCAPITA'].apply(ranking)

# Rank between [1, 4149] -> 1 the lowest, 4149 the highest
# df['RANKING'] = df.groupby('Date')['QTDPERCAPITA'].rank('dense')

Output:

Date Code City State Population Quantity QTDPERCAPITA RANKING
2020-01 11001 Los Angeles CA 5000000 100000 0.02 0
2020-02 11001 Los Angeles CA 5000000 125000 0.025 0
2020-03 11001 Los Angeles CA 5000000 135000 0.027 0
2020-01 12002 Houston TX 3000000 150000 0.05 1
2020-02 12002 Houston TX 3000000 100000 0.033 1
2020-03 12002 Houston TX 3000000 200000 0.066 1
2021-07 11001 Los Angeles CA 5500499 340000 0.618 1
2021-07 12002 Houston TX 3250012 211000 0.065 0
Add a comment

Leave a Reply

Keep Up to Date with the Most Important News

By pressing the Subscribe button, you confirm that you have read and are agreeing to our Privacy Policy and Terms of Use

Discover more from Dev solutions

Subscribe now to keep reading and get access to the full archive.

Continue reading