Follow

Keep Up to Date with the Most Important News

By pressing the Subscribe button, you confirm that you have read and are agreeing to our Privacy Policy and Terms of Use
Contact

Enforcing dataypes using pandas .astype not working as expected when followed by .replace

I have a dataframe with the following columns:

Index(['tiername', 'month', 'network', 'specnewsmarket', 'year', 'adjeng',
       'hev', 'subs', 'SN Market Grp', 'Region', 'State', 'CleanStnNm',
       'Quarter', 'StnGrp', 'StnGrpOrder', 'CleanStnNm_AllStns',
       '2021 YTD HEV', '2020 YTD HEV', 'yymm']

I’m trying to enforce these columns to be of a particular data type, as well as clean up some ‘false’ inputs before I do some manipulations. To do this, I read in a mapping from a CSV (pipe delimited) and then apply that mapping in a function like so:

import pandas as pd 
def force_columnDtypes(df) -> pd.DataFrame:
    """Ensures datatypes are correct before pivoting."""
    dtypesMapping = pd.read_csv('references/Mappings/dtypes.txt',delimiter ='|',index_col=0)['0']
    df['month'] = df['month'].astype(float) #Change month to float before converting to int
    df = df.astype(dtypesMapping)
    df = df.replace(['FALSE',False,'False'],'False')
    return df

My mapping file can be found here.
Along with some sample data

MEDevel.com: Open-source for Healthcare and Education

Collecting and validating open-source software for healthcare, education, enterprise, development, medical imaging, medical records, and digital pathology.

Visit Medevel

What I’ve discovered is that the .replace call somehow messes with my datatypes (for my last column in particular). For example if I exclude the .replace it works as expected with df.dtypes resulting in: yymm object

But after the .replace call it somehow reverts it to: yymm int64

I could probably just hardcode the dtype in there, but if someone can explain to me why this is happening that would be great!

>Solution :

This is probably due to the fact that the .replace() method tries to infer the best data type for the column after the replacement operation. Since you’re replacing ‘FALSE’, False, and ‘False’ with ‘False’, pandas might infer that the column should be of integer data type if the remaining values in the column can be represented as integers.

To solve this issue, you can enforce the desired data types again after the .replace() call. Here’s an updated version of your force_columnDtypes() function:

import pandas as pd

def force_columnDtypes(df) -> pd.DataFrame:
    """Ensures datatypes are correct before pivoting."""
    dtypesMapping = pd.read_csv('references/Mappings/dtypes.txt', delimiter='|', index_col=0)['0']
    df['month'] = df['month'].astype(float)  # Change month to float before converting to int
    df = df.astype(dtypesMapping)
    df = df.replace(['FALSE', False, 'False'], 'False')
    df = df.astype(dtypesMapping)  # Enforce data types again after replace
    return df

Let me know if it works or not because this could be not the only reason.

Add a comment

Leave a Reply

Keep Up to Date with the Most Important News

By pressing the Subscribe button, you confirm that you have read and are agreeing to our Privacy Policy and Terms of Use

Discover more from Dev solutions

Subscribe now to keep reading and get access to the full archive.

Continue reading