Confusing conversion of types in pandas DataFrame

Advertisements

Suppose I have a list of list of numbers that happen to be encoded as strings.

import pandas as pd
pylist = [['1', '43'], ['2', '42'], ['3', '41'], ['4', '40'], ['5', '39']]

Now I want a dataframe where these numbers are integers.
I can see from pandas documentation that I can force a data type via dtype, but when I run the following:

pyframe_1 = pd.DataFrame(pylist,dtype=int) 

I get the following warning:

FutureWarning: Could not cast to int32, falling back to object. This behavior is deprecated. In a future version, when a dtype is passed to 'DataFrame', either all columns will be cast to that dtype, or a TypeError will be raised.

and by inspection via dtypes:

pytypes_1 = pyframe_1.dtypes.to_list() # dtype[object_] of numpy module

my columns are np.object types.

But I can cast my columns to integer via two ways:

First one is column by column:

pyframe_2 = pd.DataFrame(pylist)
pyframe_2[0] = pyframe_2[0].astype(int)
pyframe_2[1] = pyframe_1[1].astype(int)

Second one is on the entire dataframe in an one-liner:

pyframe_3 = pd.DataFrame(pylist).astype(int)

Both give me a dataframe of integer columns from a list of list of strings.

My question is why does the first case, where I explicitly use dtype when creating a dataframe raise a warning (or error) with no conversion for the types? Why even have it as an option in the first place?

EDIT:
Pandas version I’m running is 1.4.1.

EDIT:
As per suggestions of @mozway one workaround is using
pyframe_1 = pd.DataFrame(pylist,dtype='Int32')

Which does convert to integer. I mean, to me it’s kinda unnatural using a string (which Int32 is) to force a cast instead of using much more intuitive int. Inspecting dtypes from the method, I get different integer types.
Casting with dtype='Int32' at instantiation level gets me Int32Dtype object of pandas.core.arrays.integer module. (Upon closer inspection it has an attribute of numpy_dtype which is dtype[int32] object of numpy module).
Casting with .astype(int) gives me dtype[int32] object of numpy module. So there’s not much difference, I guess? IDK.

>Solution :

I would consider it a bug. During instantiation, the input goes though several checks (_homogenize / sanitize_array / _try_cast). I believe an intermediate float dtype is created which triggers the error (on pandas 2.2):

ValueError: Trying to coerce float values to integers

A workaround would be to use:

pd.DataFrame(pylist, dtype='Int32')

   0   1
0  1  43
1  2  42
2  3  41
3  4  40
4  5  39

Leave a ReplyCancel reply