Follow

Keep Up to Date with the Most Important News

By pressing the Subscribe button, you confirm that you have read and are agreeing to our Privacy Policy and Terms of Use
Contact

Python index is out of bounds for axis 0 with size 3

I following my first machine learning project and currently despairing over the following problem.

I have the following mock test dataframe, all columns have the object format except column ‘Defect’ that has int and is the target feature.

I proceed the following steps:

MEDevel.com: Open-source for Healthcare and Education

Collecting and validating open-source software for healthcare, education, enterprise, development, medical imaging, medical records, and digital pathology.

Visit Medevel

  1. create the dataframe
  2. Split in X and y
  3. make a pipeline to one hot encode the categories
  4. use cross validation to measure accuracy of the model
import pandas as pd

data = {1 : ['test', '2222', '1111', '3333', '1111'],
        2 : ['aaa', 'aaa', 'bbbb', 'ccccc', 'aaa'],
        3 : ['x', 'y', 'z', 't', 'x'],
        'Defect': [0, 1, 0, 1, 0]
        }

data = pd.DataFrame(data)

X = data.drop('Defect', axis = 'columns')
y = data['Defect']


from sklearn.preprocessing import OneHotEncoder
from sklearn.compose import make_column_selector, make_column_transformer
from sklearn.pipeline import make_pipeline
from sklearn.linear_model import LogisticRegression

ohe = OneHotEncoder(handle_unknown='ignore')
cat_cols = make_column_selector(dtype_include = 'object')

preprocessor = make_column_transformer((make_pipeline(ohe), cat_cols))
pipe = make_pipeline(preprocessor, LogisticRegression())

from sklearn.model_selection import cross_val_score

scores = cross_val_score(pipe, X, y, cv=3, scoring='accuracy')
print(scores)

Unfortunately my output from scores is = [nan nan nan] and below the output I get the error message:

… The above exception was the direct cause of the following exception: … ValueError: all features must be in [0, 2] or [-3, 0]…

Do you have an idea why this happens? If I change the datatype for one column the code seems to work…

>Solution :

It seems it does not like the column names starting at 1. Try this:

#       V...look here
data = {0 : ['test', '2222', '1111', '3333', '1111'],
        1 : ['aaa', 'aaa', 'bbbb', 'ccccc', 'aaa'],
        2 : ['x', 'y', 'z', 't', 'x'],
        'Defect': [0, 1, 0, 1, 0]
        }
Add a comment

Leave a Reply

Keep Up to Date with the Most Important News

By pressing the Subscribe button, you confirm that you have read and are agreeing to our Privacy Policy and Terms of Use

Discover more from Dev solutions

Subscribe now to keep reading and get access to the full archive.

Continue reading