I have a dataframe with the following columns:
Index(['tiername', 'month', 'network', 'specnewsmarket', 'year', 'adjeng',
'hev', 'subs', 'SN Market Grp', 'Region', 'State', 'CleanStnNm',
'Quarter', 'StnGrp', 'StnGrpOrder', 'CleanStnNm_AllStns',
'2021 YTD HEV', '2020 YTD HEV', 'yymm']
I’m trying to enforce these columns to be of a particular data type, as well as clean up some ‘false’ inputs before I do some manipulations. To do this, I read in a mapping from a CSV (pipe delimited) and then apply that mapping in a function like so:
import pandas as pd
def force_columnDtypes(df) -> pd.DataFrame:
"""Ensures datatypes are correct before pivoting."""
dtypesMapping = pd.read_csv('references/Mappings/dtypes.txt',delimiter ='|',index_col=0)['0']
df['month'] = df['month'].astype(float) #Change month to float before converting to int
df = df.astype(dtypesMapping)
df = df.replace(['FALSE',False,'False'],'False')
return df
My mapping file can be found here.
Along with some sample data
What I’ve discovered is that the .replace call somehow messes with my datatypes (for my last column in particular). For example if I exclude the .replace it works as expected with df.dtypes resulting in: yymm object
But after the .replace call it somehow reverts it to: yymm int64
I could probably just hardcode the dtype in there, but if someone can explain to me why this is happening that would be great!
>Solution :
This is probably due to the fact that the .replace() method tries to infer the best data type for the column after the replacement operation. Since you’re replacing ‘FALSE’, False, and ‘False’ with ‘False’, pandas might infer that the column should be of integer data type if the remaining values in the column can be represented as integers.
To solve this issue, you can enforce the desired data types again after the .replace() call. Here’s an updated version of your force_columnDtypes() function:
import pandas as pd
def force_columnDtypes(df) -> pd.DataFrame:
"""Ensures datatypes are correct before pivoting."""
dtypesMapping = pd.read_csv('references/Mappings/dtypes.txt', delimiter='|', index_col=0)['0']
df['month'] = df['month'].astype(float) # Change month to float before converting to int
df = df.astype(dtypesMapping)
df = df.replace(['FALSE', False, 'False'], 'False')
df = df.astype(dtypesMapping) # Enforce data types again after replace
return df
Let me know if it works or not because this could be not the only reason.