Suppose I have a list of list of numbers that happen to be encoded as strings.
import pandas as pd
pylist = [['1', '43'], ['2', '42'], ['3', '41'], ['4', '40'], ['5', '39']]
Now I want a dataframe where these numbers are integers.
I can see from pandas documentation that I can force a data type via dtype
, but when I run the following:
pyframe_1 = pd.DataFrame(pylist,dtype=int)
I get the following warning:
FutureWarning: Could not cast to int32, falling back to object. This behavior is deprecated. In a future version, when a dtype is passed to 'DataFrame', either all columns will be cast to that dtype, or a TypeError will be raised.
and by inspection via dtypes
:
pytypes_1 = pyframe_1.dtypes.to_list() # dtype[object_] of numpy module
my columns are np.object
types.
But I can cast my columns to integer via two ways:
First one is column by column:
pyframe_2 = pd.DataFrame(pylist)
pyframe_2[0] = pyframe_2[0].astype(int)
pyframe_2[1] = pyframe_1[1].astype(int)
Second one is on the entire dataframe in an one-liner:
pyframe_3 = pd.DataFrame(pylist).astype(int)
Both give me a dataframe of integer columns from a list of list of strings.
My question is why does the first case, where I explicitly use dtype
when creating a dataframe raise a warning (or error) with no conversion for the types? Why even have it as an option in the first place?
EDIT:
Pandas version I’m running is 1.4.1.
EDIT:
As per suggestions of @mozway one workaround is using
pyframe_1 = pd.DataFrame(pylist,dtype='Int32')
Which does convert to integer. I mean, to me it’s kinda unnatural using a string (which Int32
is) to force a cast instead of using much more intuitive int
. Inspecting dtypes from the method, I get different integer types.
Casting with dtype='Int32'
at instantiation level gets me Int32Dtype object of pandas.core.arrays.integer module
. (Upon closer inspection it has an attribute of numpy_dtype
which is dtype[int32]
object of numpy module).
Casting with .astype(int)
gives me dtype[int32] object of numpy module
. So there’s not much difference, I guess? IDK.
>Solution :
I would consider it a bug. During instantiation, the input goes though several checks (_homogenize
/ sanitize_array
/ _try_cast
). I believe an intermediate float dtype is created which triggers the error (on pandas 2.2):
ValueError: Trying to coerce float values to integers
A workaround would be to use:
pd.DataFrame(pylist, dtype='Int32')
0 1
0 1 43
1 2 42
2 3 41
3 4 40
4 5 39