Follow

Keep Up to Date with the Most Important News

By pressing the Subscribe button, you confirm that you have read and are agreeing to our Privacy Policy and Terms of Use
Contact

In a Pandas Id column, how can I insert an incremented value (max + 1) for each missing Id value

In a large data set (300K rows) with an Id column (primary key), I append new rows of data. New rows do not have an Id, which I am having difficulty adding. Each new Id should be an incremented value for each row – adding 1 to the maximum value in the Id column.

The data looks something like this:

import pandas as pd
# example data frame
inp = [{'Id': 0, 'Col1': 1, 'Col2': 7},
   {'Id': 1, 'Col1': 1, 'Col2': 8},
   {'Id': 2, 'Col1': 3, 'Col2': 9},
   {'Id': '', 'Col1': 1, 'Col2': 10}, 
   {'Id': 4, 'Col1': 5, 'Col2': 11},
   {'Id': '', 'Col1': 1, 'Col2': 12}
   ]
df = pd.DataFrame(inp)
# format to be like my real data
df["Id"] = pd.to_numeric(df["Id"], errors='coerce')
df["Id"] = df["Id"].astype("Int64")

print(df)
      Id  Col1  Col2
0     0     1     7
1     1     1     8
2     2     3     9
3  <NA>     1    10
4     4     5    11
5  <NA>     1    12

Needed:

MEDevel.com: Open-source for Healthcare and Education

Collecting and validating open-source software for healthcare, education, enterprise, development, medical imaging, medical records, and digital pathology.

Visit Medevel

      Id  Col1  Col2
0     0     1     7
1     1     1     8
2     2     3     9
3     5     1    10
4     4     5    11
5     6     1    12

Fails:

df["Id"] = np.select([df["Id"].isna()], [df["Id"].max() + 1], default=df["Id"])
print(df
          Id  Col1  Col2
0     0     1     7
1     1     1     8
2     2     3     9
3     5     1    10
4     4     5    11
5     5     1    12

df["Id"] = df.apply(lambda x: df["Id"].max() + 1 if ~isinstance(x["Id"], int) else x["Id"], axis=1)
   Id  Col1  Col2
0   5     1     7
1   5     1     8
2   5     3     9
3   5     1    10
4   5     5    11
5   5     1    12

df.sort_values(by=["Id"], inplace=True)
df.set_index("Id", inplace=True)
      Col1  Col2
Id              
0        1     7
1        1     8
2        3     9
4        5    11
<NA>     1    10
<NA>     1    12

I’m not sure if I’m even on the right track (this seems like such an obvious thing to do, but I can’t find it described from the perspective I’m taking). Lambda and looping techniques were also very slow. Am I missing some simple function to do exactly this?!?

>Solution :

Try:

  • add max() to cumsum on isna
  • fillna with the above

Code:

df['Id'] = df['Id'].fillna(df['Id'].isna().cumsum()+df['Id'].max())

Output:

   Id  Col1  Col2
0   0     1     7
1   1     1     8
2   2     3     9
3   5     1    10
4   4     5    11
5   6     1    12
Add a comment

Leave a Reply

Keep Up to Date with the Most Important News

By pressing the Subscribe button, you confirm that you have read and are agreeing to our Privacy Policy and Terms of Use

Discover more from Dev solutions

Subscribe now to keep reading and get access to the full archive.

Continue reading