from my Machine Learning course, I know that 100% accuracy is overfitting but I don’t know why that is the case for me.
# Gaussian Naive Bayes
from sklearn.naive_bayes import GaussianNB
gnb = GaussianNB().fit(X_train, y_train)
gnb_predictions = gnb.predict(X_test)
gnb_acc = accuracy_score(y_test, gnb_predictions)
gnb_probs = gnb.predict_proba(X_test)
print(gnb_acc)
I am testing on unseen data. The data was artificially generated and I think it doesn’t have noise. The data has 10k entries but around half are NaNs. Can that be the problem? Or is it due to how I preprocess the data? In the preprocessing, I use LabelEncoder() since I have some strings and some floats in the features. Apart from that, I am splitting like a normal person lol
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.34, random_state=10)
>Solution :
May be three reasons:
- Small data set (5-10 rows)
- All targets are equal
- Very high correlation between features and target
Please, show dataset.