I following my first machine learning project and currently despairing over the following problem.
I have the following mock test dataframe, all columns have the object format except column ‘Defect’ that has int and is the target feature.
I proceed the following steps:
- create the dataframe
- Split in X and y
- make a pipeline to one hot encode the categories
- use cross validation to measure accuracy of the model
import pandas as pd
data = {1 : ['test', '2222', '1111', '3333', '1111'],
2 : ['aaa', 'aaa', 'bbbb', 'ccccc', 'aaa'],
3 : ['x', 'y', 'z', 't', 'x'],
'Defect': [0, 1, 0, 1, 0]
}
data = pd.DataFrame(data)
X = data.drop('Defect', axis = 'columns')
y = data['Defect']
from sklearn.preprocessing import OneHotEncoder
from sklearn.compose import make_column_selector, make_column_transformer
from sklearn.pipeline import make_pipeline
from sklearn.linear_model import LogisticRegression
ohe = OneHotEncoder(handle_unknown='ignore')
cat_cols = make_column_selector(dtype_include = 'object')
preprocessor = make_column_transformer((make_pipeline(ohe), cat_cols))
pipe = make_pipeline(preprocessor, LogisticRegression())
from sklearn.model_selection import cross_val_score
scores = cross_val_score(pipe, X, y, cv=3, scoring='accuracy')
print(scores)
Unfortunately my output from scores is = [nan nan nan] and below the output I get the error message:
… The above exception was the direct cause of the following exception: … ValueError: all features must be in [0, 2] or [-3, 0]…
Do you have an idea why this happens? If I change the datatype for one column the code seems to work…
>Solution :
It seems it does not like the column names starting at 1. Try this:
# V...look here
data = {0 : ['test', '2222', '1111', '3333', '1111'],
1 : ['aaa', 'aaa', 'bbbb', 'ccccc', 'aaa'],
2 : ['x', 'y', 'z', 't', 'x'],
'Defect': [0, 1, 0, 1, 0]
}