How to fix the missing order of id in the pandas dataset?

May 15, 2022

I am trying to fix one issue with this dataset.
The link is here.
So, I loaded the dataset this way.

df = pd.read_csv('ratings.csv', sep='::', names=['user_id', 'movie_id', 'rating', 'timestamp'])
num_of_unique_users = len(df['user_id'].unique())

The number of unique user is 69878.
If we print out the last rows of the dataset.
We can see that the user id is above 69878.
There are missing user id in this case.
Same case for movie id. There is an exceeding number of movie id than actual id.

I only want it to match the missing user_id with the existing one and not exceed 69878.
For example, the the number 75167 will become the last number of unique user id which is 69878 and the movie id 65133 will become 10677 the last unique movie id.

Actual

    user_id movie_id    rating  timestamp
0   1   122 5.0 838985046
1   1   185 5.0 838983525
2   1   231 5.0 838983392
3   1   292 5.0 838983421
4   1   316 5.0 838983392
... ... ... ... ...
10000044    71567   1984    1.0 912580553
10000045    71567   1985    1.0 912580553
10000046    71567   1986    1.0 912580553
10000047    71567   2012    3.0 912580722
10000048    71567   2028    5.0 912580344

Desired


user_id movie_id    rating  timestamp
0   1   122 5.0 838985046
1   1   185 5.0 838983525
2   1   231 5.0 838983392
3   1   292 5.0 838983421
4   1   316 5.0 838983392
... ... ... ... ...
10000044    69878   1984    1.0 912580553
10000045    69878   1985    1.0 912580553
10000046    69878   1986    1.0 912580553
10000047    69878   2012    3.0 912580722
10000048    69878   2028    5.0 912580344

Is there anyway to do this with pandas?

>Solution :

Here’s a way to do this:

df2 = df.groupby('user_id').count().reset_index()
df2 = df2.assign(new_user_id=df2.index + 1).set_index('user_id')
df = df.join(df2['new_user_id'], on='user_id').drop(columns=['user_id']).rename(columns={'new_user_id':'user_id'})

df2 = df.groupby('movie_id').count().reset_index()
df2 = df2.assign(new_movie_id=df2.index + 1).set_index('movie_id')
df = df.join(df2['new_movie_id'], on='movie_id').drop(columns=['movie_id']).rename(columns={'new_movie_id':'movie_id'})

df = pd.concat([df[['user_id', 'movie_id']], df.drop(columns=['user_id', 'movie_id'])], axis=1)

Sample input:

   user_id  movie_id  rating  timestamp
0        1         2     5.0  838985046
1        1         4     5.0  838983525
2        3         4     5.0  838983392
3        3         6     5.0  912580553
4        5         2     5.0  912580722
5        5         6     5.0  912580344

Sample output:

   user_id  movie_id  rating  timestamp
0        1         1     5.0  838985046
1        1         2     5.0  838983525
2        2         2     5.0  838983392
3        2         3     5.0  912580553
4        3         1     5.0  912580722
5        3         3     5.0  912580344

Here are intermediate results and explanations.

First we do this:

df2 = df.groupby('user_id').count().reset_index()

Output:

   user_id  movie_id  rating  timestamp
0        1         2       2          2
1        3         2       2          2
2        5         2       2          2

What we have done above is to use groupby to get one row per unique user_id. We call count just to convert the output (a groupby object) back to a dataframe. We call reset_index to create a new integer range index with no gaps. (NOTE: the only column we care about for future use is user_id.)

Next we do this:

df2 = df2.assign(new_user_id=df2.index + 1).set_index('user_id')

Output:

         movie_id  rating  timestamp  new_user_id
user_id
1               2       2          2            1
3               2       2          2            2
5               2       2          2            3

The assign call creates a new column named new_user_id which we fill using the 0 offset index plus 1 (so that we will not have id values < 1). The set_index call replaces our index with user_id in anticipation of using the index of this dataframe as the target for a late call to join.

The next step is:

df = df.join(df2['new_user_id'], on='user_id').drop(columns=['user_id']).rename(columns={'new_user_id':'user_id'})

Output:

   movie_id  rating  timestamp  user_id
0         2     5.0  838985046        1
1         4     5.0  838983525        1
2         4     5.0  838983392        2
3         6     5.0  912580553        2
4         2     5.0  912580722        3
5         6     5.0  912580344        3

Here we have taken just the new_user_id column of df2 and called join on the df object, directing the method to use the user_id column (the on argument) in df to join with the index (which was originally the user_id column in df2). This creates a df with the desired new-paradigm user_id values in the column named new_user_id. All that remains is to drop the old-paradigm user_id column and rename new_user_id to be user_id, which is what the calls to drop and rename do.

The logic for changing the movie_id values to the new paradigm (i.e., eliminating gaps in the unique value set) is completely analogous. When we’re done, we have this output:

   rating  timestamp  user_id  movie_id
0     5.0  838985046        1         1
1     5.0  838983525        1         2
2     5.0  838983392        2         2
3     5.0  912580553        2         3
4     5.0  912580722        3         1
5     5.0  912580344        3         3

To finish up, we reorder the columns to look like the original using this code:

df = pd.concat([df[['user_id', 'movie_id']], df.drop(columns=['user_id', 'movie_id'])], axis=1)

Output:

   user_id  movie_id  rating  timestamp
0        1         1     5.0  838985046
1        1         2     5.0  838983525
2        2         2     5.0  838983392
3        2         3     5.0  912580553
4        3         1     5.0  912580722
5        3         3     5.0  912580344