Follow

Keep Up to Date with the Most Important News

By pressing the Subscribe button, you confirm that you have read and are agreeing to our Privacy Policy and Terms of Use
Contact

How do I find irregular values in columns in a data-frame that have a huge number of unique values?

Below is a sample of two columns in a data-frame containing data about user-reviews for various Google Play Store apps.

Last Updated current Version
January 7, 2018 1.0.0
1.0.19 1.2.1
March 17, 2018 Varies with device

In these columns I want to find any anomalies/irregular values (such as ‘1.0.19’ in the
column, ‘Last Updated’ and ‘varies with device’ in the column, ‘current Version’ as seen in the above table) during data cleaning. However, these columns respectively have 1378 and 2832 unique values. How do I scan through these values and find the anomalies in the quickest/most efficient way possible without having to go through each unique value in the huge list of values?

MEDevel.com: Open-source for Healthcare and Education

Collecting and validating open-source software for healthcare, education, enterprise, development, medical imaging, medical records, and digital pathology.

Visit Medevel

>Solution :

you can try something like this:

df = pd.read_csv('my_file.csv')
def time_search(x):
    try:
        return pd.to_datetime(x)
    except:
        print("found extrange value:", x)
        return pd.NA

df['Last Updated'] = df['Last Updated'].apply(time_search)

output

found extrange value: 1.0.19

then should be easy to drop nan values for example

for the version column is easy to check if is valid or not

df["Current Ver"].str.contains('^[0-9].([0-9].)*')

I sugest to explore this ideas for the rest of columns

Add a comment

Leave a Reply

Keep Up to Date with the Most Important News

By pressing the Subscribe button, you confirm that you have read and are agreeing to our Privacy Policy and Terms of Use

Discover more from Dev solutions

Subscribe now to keep reading and get access to the full archive.

Continue reading