Python index is out of bounds for axis 0 with size 3

January 30, 2022

I following my first machine learning project and currently despairing over the following problem.

I have the following mock test dataframe, all columns have the object format except column ‘Defect’ that has int and is the target feature.

I proceed the following steps:

create the dataframe
Split in X and y
make a pipeline to one hot encode the categories
use cross validation to measure accuracy of the model

import pandas as pd

data = {1 : ['test', '2222', '1111', '3333', '1111'],
        2 : ['aaa', 'aaa', 'bbbb', 'ccccc', 'aaa'],
        3 : ['x', 'y', 'z', 't', 'x'],
        'Defect': [0, 1, 0, 1, 0]
        }

data = pd.DataFrame(data)

X = data.drop('Defect', axis = 'columns')
y = data['Defect']


from sklearn.preprocessing import OneHotEncoder
from sklearn.compose import make_column_selector, make_column_transformer
from sklearn.pipeline import make_pipeline
from sklearn.linear_model import LogisticRegression

ohe = OneHotEncoder(handle_unknown='ignore')
cat_cols = make_column_selector(dtype_include = 'object')

preprocessor = make_column_transformer((make_pipeline(ohe), cat_cols))
pipe = make_pipeline(preprocessor, LogisticRegression())

from sklearn.model_selection import cross_val_score

scores = cross_val_score(pipe, X, y, cv=3, scoring='accuracy')
print(scores)

Unfortunately my output from scores is = [nan nan nan] and below the output I get the error message:

… The above exception was the direct cause of the following exception: … ValueError: all features must be in [0, 2] or [-3, 0]…

Do you have an idea why this happens? If I change the datatype for one column the code seems to work…

>Solution :

It seems it does not like the column names starting at 1. Try this:

#       V...look here
data = {0 : ['test', '2222', '1111', '3333', '1111'],
        1 : ['aaa', 'aaa', 'bbbb', 'ccccc', 'aaa'],
        2 : ['x', 'y', 'z', 't', 'x'],
        'Defect': [0, 1, 0, 1, 0]
        }