Follow

Keep Up to Date with the Most Important News

By pressing the Subscribe button, you confirm that you have read and are agreeing to our Privacy Policy and Terms of Use
Contact

Confusing conversion of types in pandas DataFrame

Suppose I have a list of list of numbers that happen to be encoded as strings.

import pandas as pd
pylist = [['1', '43'], ['2', '42'], ['3', '41'], ['4', '40'], ['5', '39']]

Now I want a dataframe where these numbers are integers.
I can see from pandas documentation that I can force a data type via dtype, but when I run the following:

pyframe_1 = pd.DataFrame(pylist,dtype=int) 

I get the following warning:

MEDevel.com: Open-source for Healthcare and Education

Collecting and validating open-source software for healthcare, education, enterprise, development, medical imaging, medical records, and digital pathology.

Visit Medevel

FutureWarning: Could not cast to int32, falling back to object. This behavior is deprecated. In a future version, when a dtype is passed to 'DataFrame', either all columns will be cast to that dtype, or a TypeError will be raised.

and by inspection via dtypes:

pytypes_1 = pyframe_1.dtypes.to_list() # dtype[object_] of numpy module

my columns are np.object types.

But I can cast my columns to integer via two ways:

First one is column by column:

pyframe_2 = pd.DataFrame(pylist)
pyframe_2[0] = pyframe_2[0].astype(int)
pyframe_2[1] = pyframe_1[1].astype(int)

Second one is on the entire dataframe in an one-liner:

pyframe_3 = pd.DataFrame(pylist).astype(int)

Both give me a dataframe of integer columns from a list of list of strings.

My question is why does the first case, where I explicitly use dtype when creating a dataframe raise a warning (or error) with no conversion for the types? Why even have it as an option in the first place?

EDIT:
Pandas version I’m running is 1.4.1.

EDIT:
As per suggestions of @mozway one workaround is using
pyframe_1 = pd.DataFrame(pylist,dtype='Int32')

Which does convert to integer. I mean, to me it’s kinda unnatural using a string (which Int32 is) to force a cast instead of using much more intuitive int. Inspecting dtypes from the method, I get different integer types.
Casting with dtype='Int32' at instantiation level gets me Int32Dtype object of pandas.core.arrays.integer module. (Upon closer inspection it has an attribute of numpy_dtype which is dtype[int32] object of numpy module).
Casting with .astype(int) gives me dtype[int32] object of numpy module. So there’s not much difference, I guess? IDK.

>Solution :

I would consider it a bug. During instantiation, the input goes though several checks (_homogenize / sanitize_array / _try_cast). I believe an intermediate float dtype is created which triggers the error (on pandas 2.2):

ValueError: Trying to coerce float values to integers

A workaround would be to use:

pd.DataFrame(pylist, dtype='Int32')

   0   1
0  1  43
1  2  42
2  3  41
3  4  40
4  5  39
Add a comment

Leave a Reply

Keep Up to Date with the Most Important News

By pressing the Subscribe button, you confirm that you have read and are agreeing to our Privacy Policy and Terms of Use

Discover more from Dev solutions

Subscribe now to keep reading and get access to the full archive.

Continue reading