reading a dataset in pandas

August 30, 2022

I am trying to do a pretty simple task, but unable to understand pandas behavior.
I am reading a UCI dataset in python using pandas:

data = pd.read_csv('UCI/breast-cancer-wisconsin.data', header=None)

Printing out the first row

data.values[0]
# array([1000025, 5, 1, 1, 1, 2, '1', 3, 1, 1, 2], dtype=object)

Why does it read the 7th column as string?
I tried the following:

print(pd.api.types.infer_dtype(data[6])) #returns string

It is a pretty simple dataset, directly downloaded form this link, and all values appear integers to me. Then why is the 6th column interpreted as a string?

>Solution :

Check unique values – there is character ? for not exist value, so column is parsed to strings:

data = pd.read_csv('breast-cancer-wisconsin.data', header=None)[6]
print (data.unique())
['1' '10' '2' '4' '3' '9' '7' '?' '5' '8' '6']

Solution is add parameter na_values='?' for convert ? to missing values:

data = pd.read_csv('breast-cancer-wisconsin.data', header=None, na_values='?')
print (data.dtypes)
0       int64
1       int64
2       int64
3       int64
4       int64
5       int64
6     float64
7       int64
8       int64
9       int64
10      int64
dtype: object