I am trying to do a pretty simple task, but unable to understand pandas behavior.
I am reading a UCI dataset in python using pandas:
data = pd.read_csv('UCI/breast-cancer-wisconsin.data', header=None)
Printing out the first row
data.values[0]
# array([1000025, 5, 1, 1, 1, 2, '1', 3, 1, 1, 2], dtype=object)
Why does it read the 7th column as string?
I tried the following:
print(pd.api.types.infer_dtype(data[6])) #returns string
It is a pretty simple dataset, directly downloaded form this link, and all values appear integers to me. Then why is the 6th column interpreted as a string?
>Solution :
Check unique values – there is character ? for not exist value, so column is parsed to strings:
data = pd.read_csv('breast-cancer-wisconsin.data', header=None)[6]
print (data.unique())
['1' '10' '2' '4' '3' '9' '7' '?' '5' '8' '6']
Solution is add parameter na_values='?' for convert ? to missing values:
data = pd.read_csv('breast-cancer-wisconsin.data', header=None, na_values='?')
print (data.dtypes)
0 int64
1 int64
2 int64
3 int64
4 int64
5 int64
6 float64
7 int64
8 int64
9 int64
10 int64
dtype: object