Follow

Keep Up to Date with the Most Important News

By pressing the Subscribe button, you confirm that you have read and are agreeing to our Privacy Policy and Terms of Use
Contact

How to check valid number in pandas dataframe column?

I am working on a project for university in which I received a dataframe:

      import pandas as pd

      df = pd.DataFrame({'id': [0,1,2,3,4,5,6,7,8],
                         'number': ['00005485', '9999', '11111', '458426', '9999999999',
                                    '11111111', '800000', '99999', '1111']})

      print(df)

      id    number
      0    00005485
      1    9999
      2    11111
      3    458426
      4    9999999999
      5    11111111
      6    800000
      7    99999
      8    1111

I would like to generate a dataframe with only the lines that have a valid ‘number’. For a ‘number’ to be considered valid, it cannot have more than four repeating digits that are the same.

I made a code using filter by ‘number’ column as follows:

MEDevel.com: Open-source for Healthcare and Education

Collecting and validating open-source software for healthcare, education, enterprise, development, medical imaging, medical records, and digital pathology.

Visit Medevel

      df_result = df[(df['number'] != '9999') & (df['number'] != '99999') &
                     (df['number'] != '999999') & (df['number'] != '9999999') &
                     (df['number'] != '99999999') & (df['number'] != '999999999') &
                     (df['number'] != '9999999999') & (df['number'] != '9999999999') &
                     (df['number'] != '1111') & (df['number'] != '11111') &
                     (df['number'] != '111111') & (df['number'] != '1111111') &
                     (df['number'] != '111111111') & (df['number'] != '11111111')]

       print(df_result)

       id   number
       0    00005485
       3    458426
       6    800000
      

The output is correct. However, I would like to discover another way to do this operation on the dataframe.

>Solution :

You can use regular expression for this, in particular, \1, which matches the first group:

valid = df[~df['number'].astype(str).str.match(r'^(\d)\1{3,}$')]

So that basically says to select all rows that have the start of the string (^), a digit (\d), and then 3+ more ({3,}) of that same digit (\1), and then the end of the string ($).

Output:

>>> valid
     number
0  00005485
3    458426
6    800000

If you want to select invalid numbers instead of valid ones, remove the tilde ~ in the condition:

invalid = df[df['number'].astype(str).str.match(r'^(\d)\1{3,}$')]

Output:

>>> invalid
       number
1        9999
2       11111
4  9999999999
5    11111111
7       99999
8        1111
Add a comment

Leave a Reply

Keep Up to Date with the Most Important News

By pressing the Subscribe button, you confirm that you have read and are agreeing to our Privacy Policy and Terms of Use

Discover more from Dev solutions

Subscribe now to keep reading and get access to the full archive.

Continue reading