Identifying the categorical columns of a dataframe

February 26, 2022

I am trying to identify the categorical columns of a dataset so that I can convert them to numerical columns. I have looked at this, this, and this, among others but I still seem to be doing something wrong.

EDITED

My code:

import pandas as pd
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import train_test_split
from sklearn.svm import LinearSVC


# Read the Churn data into a dataset (pandas) from the cvs file
dataset = pd.read_csv(r'C:\Users\Amalie\IdeaProjects\INFO284\src\Lab2.csv')
print(dataset.head())

# Remove missing values (NaN's) from the dataset
ds = dataset.dropna()
columns = ds.columns.tolist()
# print(ds.dtypes())
print("\nColumns: {}".format(columns))

# Numerical columns
numericCols = ds._get_numeric_data().columns
print("Numerical: {}".format(numericCols))                  # 'SeniorCitizen', 'tenure', 'MonthlyCharges'

# Categorical columns
categorical = ds.select_dtypes(include=['category'])
print("Categorical: {}".format(categorical))

y = ds['Churn']          # Target
X = ds.drop('Churn', 1)  # Features ( all other than target column 'Churn' )

# Logistic Regression
X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=20)  # Split into test/training sets
logReg = LogisticRegression(max_iter=100000).fit(X_train, y_train)
print("Training set score: {:.3f}".format(logReg.score(X_train, y_train)))
print("Test set score: {:.3f}".format(logReg.score(X_test, y_test)))

And it gives me this output:

   customerID  gender  SeniorCitizen  ... MonthlyCharges TotalCharges  Churn
0  7590-VHVEG  Female              0  ...          29.85        29.85     No
1  5575-GNVDE    Male              0  ...          56.95       1889.5     No
2  3668-QPYBK    Male              0  ...          53.85       108.15    Yes
3  7795-CFOCW    Male              0  ...          42.30      1840.75     No
4  9237-HQITU  Female              0  ...          70.70       151.65    Yes

[5 rows x 21 columns]
C:/Users/Amalie/IdeaProjects/INFO284/src/Lab5.py:26: FutureWarning: In a future version of pandas all arguments of DataFrame.drop except for the argument 'labels' will be keyword-only
  X = ds.drop('Churn', 1)  # Features ( all other than target column 'Churn' )

Columns: ['customerID', 'gender', 'SeniorCitizen', 'Partner', 'Dependents', 'tenure', 'PhoneService', 'MultipleLines', 'InternetService', 'OnlineSecurity', 'OnlineBackup', 'DeviceProtection', 'TechSupport', 'StreamingTV', 'StreamingMovies', 'Contract', 'PaperlessBilling', 'PaymentMethod', 'MonthlyCharges', 'TotalCharges', 'Churn']
Numerical: Index(['SeniorCitizen', 'tenure', 'MonthlyCharges'], dtype='object')
Categorical: Empty DataFrame
Columns: []
Index: [0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31, 32, 33, 34, 35, 36, 37, 38, 39, 40, 41, 42, 43, 44, 45, 46, 47, 48, 49, 50, 51, 52, 53, 54, 55, 56, 57, 58, 59, 60, 61, 62, 63, 64, 65, 66, 67, 68, 69, 70, 71, 72, 73, 74, 75, 76, 77, 78, 79, 80, 81, 82, 83, 84, 85, 86, 87, 88, 89, 90, 91, 92, 93, 94, 95, 96, 97, 98, 99, ...]

[7043 rows x 0 columns]
Traceback (most recent call last):
  File "C:/Users/Amalie/IdeaProjects/INFO284/src/Lab5.py", line 30, in <module>
    logReg = LogisticRegression(max_iter=100000).fit(X_train, y_train)
  File "C:\Users\Amalie\IdeaProjects\INFO284\venv\lib\site-packages\sklearn\linear_model\_logistic.py", line 1514, in fit
    accept_large_sparse=solver not in ["liblinear", "sag", "saga"],
  File "C:\Users\Amalie\IdeaProjects\INFO284\venv\lib\site-packages\sklearn\base.py", line 581, in _validate_data
    X, y = check_X_y(X, y, **check_params)
  File "C:\Users\Amalie\IdeaProjects\INFO284\venv\lib\site-packages\sklearn\utils\validation.py", line 976, in check_X_y
    estimator=estimator,
  File "C:\Users\Amalie\IdeaProjects\INFO284\venv\lib\site-packages\sklearn\utils\validation.py", line 746, in check_array
    array = np.asarray(array, order=order, dtype=dtype)
  File "C:\Users\Amalie\IdeaProjects\INFO284\venv\lib\site-packages\pandas\core\generic.py", line 1993, in __array__
    return np.asarray(self._values, dtype=dtype)
ValueError: could not convert string to float: '3428-MMGUB'

Process finished with exit code 1

Which means I get an empty dataframe with this line categorical = ds.select_dtypes(include=['category']), but I know that there are categorical columns there because I get an error when I try to use the fit()-method for do logistic regression.
Like so:

# Logistic Regression
X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=20)  # Split into test/training sets
logReg = LogisticRegression(max_iter=100000).fit(X_train, y_train)

The error I get:

Traceback (most recent call last):
  File "C:/Users/Amalie/IdeaProjects/INFO284/src/Lab5.py", line 30, in <module>
    logReg = LogisticRegression(max_iter=100000).fit(X_train, y_train)
  File "C:\Users\Amalie\IdeaProjects\INFO284\venv\lib\site-packages\sklearn\linear_model\_logistic.py", line 1514, in fit
    accept_large_sparse=solver not in ["liblinear", "sag", "saga"],
  File "C:\Users\Amalie\IdeaProjects\INFO284\venv\lib\site-packages\sklearn\base.py", line 581, in _validate_data
    X, y = check_X_y(X, y, **check_params)
  File "C:\Users\Amalie\IdeaProjects\INFO284\venv\lib\site-packages\sklearn\utils\validation.py", line 976, in check_X_y
    estimator=estimator,
  File "C:\Users\Amalie\IdeaProjects\INFO284\venv\lib\site-packages\sklearn\utils\validation.py", line 746, in check_array
    array = np.asarray(array, order=order, dtype=dtype)
  File "C:\Users\Amalie\IdeaProjects\INFO284\venv\lib\site-packages\pandas\core\generic.py", line 1993, in __array__
    return np.asarray(self._values, dtype=dtype)
ValueError: could not convert string to float: '3428-MMGUB'

If I try to include print(ds.dtypes()) in line 14, I get this output:

   customerID  gender  SeniorCitizen  ... MonthlyCharges TotalCharges  Churn
0  7590-VHVEG  Female              0  ...          29.85        29.85     No
1  5575-GNVDE    Male              0  ...          56.95       1889.5     No
2  3668-QPYBK    Male              0  ...          53.85       108.15    Yes
3  7795-CFOCW    Male              0  ...          42.30      1840.75     No
4  9237-HQITU  Female              0  ...          70.70       151.65    Yes

[5 rows x 21 columns]
Traceback (most recent call last):
  File "C:/Users/Amalie/IdeaProjects/INFO284/src/Lab5.py", line 14, in <module>
    print(ds.dtypes())
TypeError: 'Series' object is not callable

Process finished with exit code 1

How do I fix this? What am I doing wrong? All I am trying to do is to do logistic regression, but I seem to be stuck at the first step of organizin the data.

>Solution :

Your independent features include categorical data. The error is raised because you have some columns in string and it cannot be interpreted as float to train the model.

My suggestion is to use get_dummies.

This example might help you:

import pandas as pd

r = pd.DataFrame(['France','Japan','Spain','France','USA'],columns= ['Country'])
r['gendor'] = ['male','female','female','female','male']
r = pd.get_dummies(r)
r.head()

   Country_France  Country_Japan  ...  gendor_female  gendor_male
0               1              0  ...              0            1
1               0              1  ...              1            0
2               0              0  ...              1            0
3               1              0  ...              1            0
4               0              0  ...              0            1
[5 rows x 6 columns]

>>>

All categorical columns are automatically converted using hot label encoding.

Once you convert your categorical data you can fit the LogisticRegression.