Follow

Keep Up to Date with the Most Important News

By pressing the Subscribe button, you confirm that you have read and are agreeing to our Privacy Policy and Terms of Use
Contact

Converting a List into a formated pandas Dataframe

I am trying to scrape the top 100 cities in the US from the following website:

**https://www.nationalpopularvote.com/100-biggest-cities-have-59849899-people-and-rural-areas-have-59492267-people
**

I have gotten the data from the website and converted into a list:

MEDevel.com: Open-source for Healthcare and Education

Collecting and validating open-source software for healthcare, education, enterprise, development, medical imaging, medical records, and digital pathology.

Visit Medevel

#Note this is not the full list because the full list is too long
List = ['City   Population', '1   New York, NY   8,175,133   \xa0Biggest city is 2.6% of U.S. population', '2   Los Angeles, CA   3,792,621   \xa0Top 2 cities are 3.8% of U.S. population', '3   Chicago, IL   2,695,598   \xa0Top 3 cities are 4.7% of U.S. population]

I am trying to convert that into an organized dataframe.

I have already tried this:

df = pd.DataFrame(List)

print(df)

I wish this could have been that simple but it returns the following:

    0                                    City   Population
    1    1   New York, NY   8,175,133    Biggest city i...
    2    2   Los Angeles, CA   3,792,621    Top 2 citie...
    3    3   Chicago, IL   2,695,598    Top 3 cities ar...
    4    4   Houston, TX   2,099,451    Top 4 cities ar...
    ..                                                 ...
    97   97   Birmingham, AL   212,237
    98   98   Rochester, NY   210,565
    99   99   San Bernadino, CA   209,924
    100  100   Spokane, WA   208,916   Top 100 cities a...
    101                                 Total   59,849,899
[102 rows x 1 columns]

The problem is, its not actually ‘orginized’; I cant do print(df['City']). I want this:

    0   RANK CITY           POPULATION                    
    1    1   New York, NY   8,175,133    
    2    2   Los Angeles, CA   3,792,621    
    3    3   Chicago, IL   2,695,598    
    4    4   Houston, TX   2,099,451    
    ...............................                                    
    97   97   Birmingham, AL   212,237
    98   98   Rochester, NY   210,565
    99   99   San Bernadino, CA   209,924
    100  100   Spokane, WA   208,916  

[101 rows x 3 columns]

Can someone help me with this?

>Solution :

You can do this in a one-liner:

  • Use pd.read_html with the header and index_col parameters both set to 0. The result will be a list of dfs. In this case with only one df, so we select the first element ([0]).
  • Use df.reset_index to reset the index, since the original index values will have been turned into floats on account of the NaN value in the final row.
import pandas as pd

url = 'https://www.nationalpopularvote.com/100-biggest-cities-have-59849899-people-and-rural-areas-have-59492267-people'

df = pd.read_html(url, header=0, index_col=0)[0].reset_index(drop=True)

df.head()

               City  Population                                Unnamed: 3
0      New York, NY     8175133   Biggest city is 2.6% of U.S. population
1   Los Angeles, CA     3792621  Top 2 cities are 3.8% of U.S. population
2       Chicago, IL     2695598  Top 3 cities are 4.7% of U.S. population
3       Houston, TX     2099451  Top 4 cities are 5.4% of U.S. population
4  Philadelphia, PA     1526006  Top 5 cities are 5.9% of U.S. population

# use `df.rename` to change the name of the 3rd (nameless) column)
df = df.rename(columns={'Unnamed: 3': 'Comment'})
Add a comment

Leave a Reply

Keep Up to Date with the Most Important News

By pressing the Subscribe button, you confirm that you have read and are agreeing to our Privacy Policy and Terms of Use

Discover more from Dev solutions

Subscribe now to keep reading and get access to the full archive.

Continue reading