I’m deploying a machine learning model using Python scripts (.py files) within an automated server workflow. The core of the model training process resides in model_training.py, which contains functions for data preprocessing, model training with hyperparameter optimization using Optuna, and model evaluation.
The deployment flow is orchestrated through main.py, where I execute the entire pipeline. Up until the stage where I retrieve best_params for model training, everything runs smoothly. However, at the best_params stage, the script appears to get stuck indefinitely, similar to what’s illustrated in the provided image (even when I test with n_trials=1 and early_stopping_rounds=1).
Here model_training.py:
import lightgbm as lgb
from sklearn.metrics import mean_squared_error
import numpy as np
import optuna
from sklearn.model_selection import train_test_split
from optuna.integration import LightGBMPruningCallback
import warnings
warnings.filterwarnings("ignore", message="Found `n_estimators` in params. Will use it instead of argument")
optuna.logging.set_verbosity(optuna.logging.INFO)
seed = 42
np.random.seed(42)
def train_validation_test_split(X, y, test_size=0.2, random_state=seed):
"""
A function to split input data into training, validation, and test sets.
Parameters:
X (array-like): The input features.
y (array-like): The target variable.
test_size (float): The proportion of the dataset to include in the test split.
random_state (int): Controls the randomness of the training and testing indices.
Returns:
X_train (array-like): Training data for input features.
X_test (array-like): Testing data for input features.
y_train (array-like): Training data for target variable.
y_test (array-like): Testing data for target variable.
"""
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=test_size, random_state=random_state)
return X_train, X_test, y_train, y_test
def pre_lgb_dataset(X_train, X_test, y_train, y_test, cat_cols):
"""
Generate a LightGBM Dataset for training, validation, and testing data.
Parameters:
- X_train: training data features
- X_test: testing data features
- y_train: training data labels
- y_test: testing data labels
- cat_cols: list of categorical columns
- type: a string indicating the type of dataset
Returns:
- train_data: LightGBM Dataset for training data
- val_data: LightGBM Dataset for validation data
- test_data: LightGBM Dataset for testing data
"""
train_data = lgb.Dataset(X_train, label=y_train, categorical_feature=cat_cols,free_raw_data=False)
test_data = lgb.Dataset(X_test, label=y_test, categorical_feature=cat_cols,free_raw_data=False)
return train_data, test_data
def train_optuna_cv(train_data, n_folds=5, n_trials=1, logging_period=10, early_stopping_rounds=10):
"""
Trains a LightGBM model using Optuna for hyperparameter optimization with cross-validation.
Parameters:
- data: Features for training.
- n_folds: Number of folds for cross-validation (default is 5).
- n_trials: Number of optimization trials to run (default is 100).
- logging_period: Interval for logging evaluation metrics during training (default is 10).
- early_stopping_rounds: Rounds to trigger early stopping if no improvement (default is 10).
Returns:
- best_params: Dictionary of the best hyperparameters found by Optuna.
"""
def objective(trial):
# Define the hyperparameter search space
params = {
'objective': 'regression',
'metric': 'rmse',
'lambda_l1': trial.suggest_float('lambda_l1', 1e-8, 10.0, log=True),
'lambda_l2': trial.suggest_float('lambda_l2', 1e-8, 10.0, log=True),
'learning_rate': trial.suggest_float('learning_rate', 1e-3, 5e-1, log=True),
'num_leaves': trial.suggest_int('num_leaves', 2, 256),
'feature_fraction': trial.suggest_float('feature_fraction', 0.4, 1.0),
'bagging_fraction': trial.suggest_float('bagging_fraction', 0.4, 1.0),
'bagging_freq': trial.suggest_int('bagging_freq', 1, 7),
'num_threads': 4,
'verbosity': -1 # Suppress internal LightGBM logging
}
# Perform cross-validation
cv_results = lgb.cv(
params,
train_data,
nfold=n_folds,
stratified=False, # Usually, stratification is not needed for regression
shuffle=True, # Shuffle data before splitting
callbacks=[
lgb.early_stopping(stopping_rounds=early_stopping_rounds),
lgb.log_evaluation(period=logging_period),
LightGBMPruningCallback(trial, 'rmse')
],
seed=42,
)
# Get the best score from cross-validation
best_score = cv_results['valid rmse-mean'][-1]
return best_score
# Create an Optuna study and optimize
study = optuna.create_study(direction='minimize')
study.optimize(objective, n_trials=n_trials)
# Return the best found hyperparameters
best_params = study.best_params
return best_params
def model_pred(best_params, train_data, val_data):
"""
Train the LightGBM model with the best hyperparameters
on the whole dataset and the lower and upper quantile models
on the validation set.
Args:
best_params: The best hyperparameters found by Optuna.
train_data: Training data for the LightGBM model.
val_data: Validation data for the LightGBM model and lower/upper quantile models.
Returns:
best_model: The trained LightGBM model.
"""
# Train the model
best_model = lgb.train(best_params, train_data, valid_sets=[val_data])
return best_model
Here’s a simplified structure of my workflow in main.py:
from model_training import train_validation_test_split, pre_lgb_dataset, train_optuna_cv, model_pred
import pandas as pd
import numpy as np
import optuna
seed = 42
np.random.seed(42)
def main():
# Data preparation and feature engineering steps here...
# Model Training
X_train, X_test, y_train, y_test = train_validation_test_split(df_features, df_target)
train_data, test_data = pre_lgb_dataset(X_train, X_test, y_train, y_test, cat_cols)
# Hyperparameter Optimization
best_params = train_optuna_cv(train_data, n_trials=1, early_stopping_rounds=1)
# Model Training with Best Parameters
best_model = model_pred(best_params, train_data, test_data)
# Further steps for model evaluation and deployment...
if __name__ == "__main__":
main()
To debug, I tried using a simplified sample_params as follows, and it ran without any issues
sample_params = { 'objective': 'regression', 'metric': 'rmse', 'num_leaves': 31, 'learning_rate': 0.05, 'num_threads': 4 }
- What could be causing the script to get stuck at the best_params step despite simpler configurations running fine?
- Any suggestions on how to troubleshoot or debug this issue further in an automated deployment environment?
Any insights or advice would be greatly appreciated. Thank you!
>Solution :
The issue you’re experiencing might be due to the complexity of the hyperparameter search space and the optimization process. Even with n_trials=1 and early_stopping_rounds=1, Optuna still needs to explore the hyperparameter space and run the model at least once, which can be time-consuming depending on the size of your dataset and the complexity of your model.
Here are some suggestions on how to troubleshoot or debug this issue:
- Logging: Add logging statements in your code to track the progress of the optimization process. This can help you identify where the process is getting stuck.
import logging
logging.basicConfig(level=logging.INFO)
-
Simplify the Search Space: Reduce the complexity of the hyperparameter search space. For example, you can limit the number of leaves (
num_leaves) or reduce the range oflearning_rate. -
Use a Subset of Data: Try running the optimization process on a smaller subset of your data. This can help you determine if the issue is related to the size of your dataset.
-
Check System Resources: Monitor the CPU and memory usage of your server during the optimization process. If your server is running out of resources, it could cause the process to hang.
-
Timeout: Implement a timeout for the optimization process. This can prevent the process from running indefinitely. Optuna supports setting a timeout for the optimization process using the
timeoutargument in theoptimizemethod.
study.optimize(objective, n_trials=n_trials, timeout=600) # 600 seconds = 10 minutes
- Parallelization: If your server has multiple cores, you can use Optuna’s parallelization feature to speed up the optimization process.
study.optimize(objective, n_trials=n_trials, n_jobs=-1) # Use all available cores
Remember to test these changes in a controlled environment before deploying them to your production server.