I am trying to scrape the top 100 cities in the US from the following website:
I have gotten the data from the website and converted into a list:
#Note this is not the full list because the full list is too long
List = ['City Population', '1 New York, NY 8,175,133 \xa0Biggest city is 2.6% of U.S. population', '2 Los Angeles, CA 3,792,621 \xa0Top 2 cities are 3.8% of U.S. population', '3 Chicago, IL 2,695,598 \xa0Top 3 cities are 4.7% of U.S. population]
I am trying to convert that into an organized dataframe.
I have already tried this:
df = pd.DataFrame(List)
print(df)
I wish this could have been that simple but it returns the following:
0 City Population
1 1 New York, NY 8,175,133 Biggest city i...
2 2 Los Angeles, CA 3,792,621 Top 2 citie...
3 3 Chicago, IL 2,695,598 Top 3 cities ar...
4 4 Houston, TX 2,099,451 Top 4 cities ar...
.. ...
97 97 Birmingham, AL 212,237
98 98 Rochester, NY 210,565
99 99 San Bernadino, CA 209,924
100 100 Spokane, WA 208,916 Top 100 cities a...
101 Total 59,849,899
[102 rows x 1 columns]
The problem is, its not actually ‘orginized’; I cant do print(df['City']). I want this:
0 RANK CITY POPULATION
1 1 New York, NY 8,175,133
2 2 Los Angeles, CA 3,792,621
3 3 Chicago, IL 2,695,598
4 4 Houston, TX 2,099,451
...............................
97 97 Birmingham, AL 212,237
98 98 Rochester, NY 210,565
99 99 San Bernadino, CA 209,924
100 100 Spokane, WA 208,916
[101 rows x 3 columns]
Can someone help me with this?
>Solution :
You can do this in a one-liner:
- Use
pd.read_htmlwith theheaderandindex_colparameters both set to0. The result will be a list ofdfs. In this case with only onedf, so we select the first element ([0]). - Use
df.reset_indexto reset the index, since the original index values will have been turned into floats on account of theNaN valuein the final row.
import pandas as pd
url = 'https://www.nationalpopularvote.com/100-biggest-cities-have-59849899-people-and-rural-areas-have-59492267-people'
df = pd.read_html(url, header=0, index_col=0)[0].reset_index(drop=True)
df.head()
City Population Unnamed: 3
0 New York, NY 8175133 Biggest city is 2.6% of U.S. population
1 Los Angeles, CA 3792621 Top 2 cities are 3.8% of U.S. population
2 Chicago, IL 2695598 Top 3 cities are 4.7% of U.S. population
3 Houston, TX 2099451 Top 4 cities are 5.4% of U.S. population
4 Philadelphia, PA 1526006 Top 5 cities are 5.9% of U.S. population
# use `df.rename` to change the name of the 3rd (nameless) column)
df = df.rename(columns={'Unnamed: 3': 'Comment'})