Hyperparameter Tuning Methods & Cross validation#
以下示範直接結合search方法以及cross validation。
演算法的部分使用上一部份介紹的Elastic Net來做示範。
資料集的部分同樣也使用scikit-learn內建的toy dataset: diabetes來示範。
import numpy as np
import pandas as pd
from hyperopt import hp, tpe, fmin, Trials, space_eval
from hyperopt.pyll.base import scope
from scipy.stats import uniform
from scipy.stats import randint
from sklearn.datasets import load_diabetes
from sklearn.ensemble import GradientBoostingRegressor
from sklearn.linear_model import ElasticNet
from sklearn.metrics import mean_squared_error
from sklearn.model_selection import cross_val_score
from sklearn.model_selection import GridSearchCV
from sklearn.model_selection import RandomizedSearchCV
from sklearn.model_selection import train_test_split
讀入資料集
# Load the diabetes dataset
diabetes = load_diabetes()
X, y = diabetes.data, diabetes.target
切分資料集為train / test
# 80% for training data, 20% for testing data
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
# 第1個數字代表row, 第2個數字column
print(X.shape, X_train.shape, X_test.shape)
print(y.shape, y_train.shape, y_test.shape)
(442, 10) (353, 10) (89, 10)
(442,) (353,) (89,)
Grid search#
首先需定義我們要挑選的超參數,將要測試的超參數值一個一個明確定義出來,注意需要用一個dictionary來定義:
# Define a range of values for alpha and l1_ratio
param_grid = {
    'alpha': [0.001, 0.01, 0.1, 1.0],
    'l1_ratio': [0.1, 0.3, 0.5, 0.7, 0.9, 1]
}
接著,定義estimator,estimator就是machine learning model的意思,在scikit-learn的語言中稱作estimator。
# Create ElasticNet model
elasticnet = ElasticNet()
初始化建立GridSearchCV物件,並傳入要試驗的超參數、estimator、cv、scoring等參數。
cv = 5代表使用 5-fold的cross validation。
注意”scoring”的值是越高代表越好,剛好跟損失的方向相反,所以是使用”neg_mean_squared_error”。
另外,”refit=True”代表使用cross-validation找到最佳超參數組合後,重新用整個training dataset再fit一次。
# Perform grid search
grid_search = GridSearchCV(
    estimator=elasticnet,
    param_grid=param_grid,
    cv=5, # 輸入int的話,及代表使用K-fold cross validation。
    scoring='neg_mean_squared_error',
    refit=True
)
初始化建立後的GridSearchCV物件可以跟estimator一樣,直接使用”.fit”方法,就可以進行訓練。
grid_search.fit(X_train, y_train)
GridSearchCV(cv=5, estimator=ElasticNet(),
             param_grid={'alpha': [0.001, 0.01, 0.1, 1.0],
                         'l1_ratio': [0.1, 0.3, 0.5, 0.7, 0.9, 1]},
             scoring='neg_mean_squared_error')In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook. On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.
GridSearchCV(cv=5, estimator=ElasticNet(),
             param_grid={'alpha': [0.001, 0.01, 0.1, 1.0],
                         'l1_ratio': [0.1, 0.3, 0.5, 0.7, 0.9, 1]},
             scoring='neg_mean_squared_error')ElasticNet()
ElasticNet()
查看grid search找出來的超參數:
# the best hyperparameters are stored in an attribute
grid_search.best_params_
{'alpha': 0.001, 'l1_ratio': 0.7}
也可以看每一次validation的結果:
results = pd.DataFrame(grid_search.cv_results_)
超參數組合有 4 * 6 = 24種,且 cv = 5,因此總共訓練了 24 * 5 = 120個模型。
其中,一種超參數組合會訓練5次,每次使用5-fold切出的資料集的其中一份做訓練。
故可以依據每次的模型表現,計算出平均值(mean_test_score)與標準差(std_test_score)。
results[['params', 'mean_test_score', 'std_test_score']]
| params | mean_test_score | std_test_score | |
|---|---|---|---|
| 0 | {'alpha': 0.001, 'l1_ratio': 0.1} | -3187.882837 | 315.570256 | 
| 1 | {'alpha': 0.001, 'l1_ratio': 0.3} | -3159.191940 | 303.050408 | 
| 2 | {'alpha': 0.001, 'l1_ratio': 0.5} | -3136.284418 | 293.575444 | 
| 3 | {'alpha': 0.001, 'l1_ratio': 0.7} | -3123.123456 | 288.830194 | 
| 4 | {'alpha': 0.001, 'l1_ratio': 0.9} | -3126.834389 | 294.232405 | 
| 5 | {'alpha': 0.001, 'l1_ratio': 1} | -3141.471228 | 349.349112 | 
| 6 | {'alpha': 0.01, 'l1_ratio': 0.1} | -4277.110727 | 751.094152 | 
| 7 | {'alpha': 0.01, 'l1_ratio': 0.3} | -4086.935171 | 687.834101 | 
| 8 | {'alpha': 0.01, 'l1_ratio': 0.5} | -3852.786188 | 602.741576 | 
| 9 | {'alpha': 0.01, 'l1_ratio': 0.7} | -3559.344994 | 483.356143 | 
| 10 | {'alpha': 0.01, 'l1_ratio': 0.9} | -3209.845750 | 324.280123 | 
| 11 | {'alpha': 0.01, 'l1_ratio': 1} | -3137.393500 | 305.636041 | 
| 12 | {'alpha': 0.1, 'l1_ratio': 0.1} | -5748.957646 | 1111.435643 | 
| 13 | {'alpha': 0.1, 'l1_ratio': 0.3} | -5661.121161 | 1095.145007 | 
| 14 | {'alpha': 0.1, 'l1_ratio': 0.5} | -5516.401740 | 1067.196600 | 
| 15 | {'alpha': 0.1, 'l1_ratio': 0.7} | -5233.568420 | 1008.164063 | 
| 16 | {'alpha': 0.1, 'l1_ratio': 0.9} | -4433.090076 | 802.242986 | 
| 17 | {'alpha': 0.1, 'l1_ratio': 1} | -3128.888539 | 259.324330 | 
| 18 | {'alpha': 1.0, 'l1_ratio': 0.1} | -6092.146117 | 1172.011640 | 
| 19 | {'alpha': 1.0, 'l1_ratio': 0.3} | -6087.807660 | 1171.691388 | 
| 20 | {'alpha': 1.0, 'l1_ratio': 0.5} | -6080.064280 | 1171.113288 | 
| 21 | {'alpha': 1.0, 'l1_ratio': 0.7} | -6061.185941 | 1169.776629 | 
| 22 | {'alpha': 1.0, 'l1_ratio': 0.9} | -5968.311222 | 1162.890744 | 
| 23 | {'alpha': 1.0, 'l1_ratio': 1} | -3997.783751 | 673.425013 | 
因為設定refit=True,可以使用”.best_estimator_”直接呼叫出使用整個資料集重新訓練的estimator,以及該estimator的相關資訊。
# Printing the coefficients
print("Coefficients:", grid_search.best_estimator_.coef_)
print("Intercept:", grid_search.best_estimator_.intercept_)
print("n of iteration:", grid_search.best_estimator_.n_iter_)
Coefficients: [  42.58661138 -203.18317615  502.36677313  315.50724924 -104.02651837
  -86.91840472 -191.27151509  150.11457801  389.19719854   80.89928556]
Intercept: 151.468655969913
n of iteration: 30
最後,評估在testing data上的表現:
# Make predictions
y_pred_best_elasticnet = grid_search.best_estimator_.predict(X_test)
# Calculate Mean Squared Error (MSE) on test set
mse_best_elasticnet = mean_squared_error(y_test, y_pred_best_elasticnet)
print("Best ElasticNet Regression MSE:", mse_best_elasticnet)
Best ElasticNet Regression MSE: 2855.158739341711
各種cross validation請參考:
GridSearchCV 參考資料:
Random search#
在Random search中,超參數組合並非直接設定參數值,而是輸入超參數的分佈。
為了示範超參數搜尋空間的設定方式,改用scikit-learn的gradient boosting。
# Create Gradient Boosting model
gbm = GradientBoostingRegressor(random_state=42)
設定超參數搜尋空間:
- 若超參數有小數點,可以scipy.stats.uniform抽取。 
- 若超參數是整數類型,可以使用scipy.stats.randint抽取。 
- 若超參數是指定字串,可以直接使用tuple。 
# determine the hyperparameter space
param_distributions = {
    'n_estimators': randint(10, 100),
    'max_depth': randint(1, 5),
    'min_samples_split': uniform(0, 1),
    'criterion': ('friedman_mse', 'squared_error'),
}
建立RandomizedSearchCV物件,寫法與GridSearchCV幾乎一樣。
僅須將param_grid 改為 param_distributions。
# set up the search
random_search = RandomizedSearchCV(
    estimator=gbm,
    param_distributions=param_distributions, 
    cv=5,
    n_iter=24,
    scoring='neg_mean_squared_error',
    refit=True,
    random_state=42
)
# find best hyperparameters
random_search.fit(X_train, y_train)
RandomizedSearchCV(cv=5, estimator=GradientBoostingRegressor(random_state=42),
                   n_iter=24,
                   param_distributions={'criterion': ('friedman_mse',
                                                      'squared_error'),
                                        'max_depth': <scipy.stats._distn_infrastructure.rv_discrete_frozen object at 0x13eb03fd0>,
                                        'min_samples_split': <scipy.stats._distn_infrastructure.rv_continuous_frozen object at 0x13eb36ed0>,
                                        'n_estimators': <scipy.stats._distn_infrastructure.rv_discrete_frozen object at 0x13eb364d0>},
                   random_state=42, scoring='neg_mean_squared_error')In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook. On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.
RandomizedSearchCV(cv=5, estimator=GradientBoostingRegressor(random_state=42),
                   n_iter=24,
                   param_distributions={'criterion': ('friedman_mse',
                                                      'squared_error'),
                                        'max_depth': <scipy.stats._distn_infrastructure.rv_discrete_frozen object at 0x13eb03fd0>,
                                        'min_samples_split': <scipy.stats._distn_infrastructure.rv_continuous_frozen object at 0x13eb36ed0>,
                                        'n_estimators': <scipy.stats._distn_infrastructure.rv_discrete_frozen object at 0x13eb364d0>},
                   random_state=42, scoring='neg_mean_squared_error')GradientBoostingRegressor(random_state=42)
GradientBoostingRegressor(random_state=42)
# the best hyperparameters are stored in an attribute
random_search.best_params_
{'criterion': 'friedman_mse',
 'max_depth': 4,
 'min_samples_split': 0.9507143064099162,
 'n_estimators': 81}
results = pd.DataFrame(random_search.cv_results_)
results[['params', 'mean_test_score', 'std_test_score']]
| params | mean_test_score | std_test_score | |
|---|---|---|---|
| 0 | {'criterion': 'friedman_mse', 'max_depth': 4, ... | -3270.284187 | 440.600753 | 
| 1 | {'criterion': 'friedman_mse', 'max_depth': 1, ... | -3288.517942 | 454.861697 | 
| 2 | {'criterion': 'friedman_mse', 'max_depth': 3, ... | -3421.653352 | 474.373010 | 
| 3 | {'criterion': 'friedman_mse', 'max_depth': 2, ... | -3523.229717 | 653.712201 | 
| 4 | {'criterion': 'squared_error', 'max_depth': 2,... | -3410.198901 | 541.731307 | 
| 5 | {'criterion': 'squared_error', 'max_depth': 1,... | -3313.203288 | 453.233960 | 
| 6 | {'criterion': 'squared_error', 'max_depth': 1,... | -3334.601253 | 450.517748 | 
| 7 | {'criterion': 'friedman_mse', 'max_depth': 3, ... | -3491.914117 | 590.298455 | 
| 8 | {'criterion': 'squared_error', 'max_depth': 3,... | -3446.320922 | 542.976666 | 
| 9 | {'criterion': 'squared_error', 'max_depth': 3,... | -3469.801150 | 543.574447 | 
| 10 | {'criterion': 'friedman_mse', 'max_depth': 3, ... | -3840.275830 | 566.768720 | 
| 11 | {'criterion': 'friedman_mse', 'max_depth': 1, ... | -4089.426637 | 630.516379 | 
| 12 | {'criterion': 'friedman_mse', 'max_depth': 4, ... | -3616.452964 | 489.227322 | 
| 13 | {'criterion': 'squared_error', 'max_depth': 1,... | -3311.315789 | 451.025908 | 
| 14 | {'criterion': 'friedman_mse', 'max_depth': 4, ... | -3421.129652 | 561.377783 | 
| 15 | {'criterion': 'friedman_mse', 'max_depth': 2, ... | -3365.814313 | 519.152129 | 
| 16 | {'criterion': 'squared_error', 'max_depth': 4,... | -3598.708097 | 441.823216 | 
| 17 | {'criterion': 'squared_error', 'max_depth': 2,... | -3382.186181 | 528.303504 | 
| 18 | {'criterion': 'squared_error', 'max_depth': 2,... | -3391.705158 | 541.156038 | 
| 19 | {'criterion': 'squared_error', 'max_depth': 4,... | -3423.325471 | 527.389894 | 
| 20 | {'criterion': 'friedman_mse', 'max_depth': 4, ... | -3415.351054 | 517.633160 | 
| 21 | {'criterion': 'squared_error', 'max_depth': 1,... | -3288.336950 | 450.299688 | 
| 22 | {'criterion': 'friedman_mse', 'max_depth': 1, ... | -3298.269411 | 461.042074 | 
| 23 | {'criterion': 'squared_error', 'max_depth': 4,... | -3404.593638 | 493.857562 | 
# Make predictions
y_pred_best_gbm = random_search.best_estimator_.predict(X_test)
# Calculate Mean Squared Error (MSE) on test set
mse_best_gbm = mean_squared_error(y_test, y_pred_best_gbm)
print("Best Gradient Boosting Regressor MSE:", mse_best_gbm)
Best Gradient Boosting Regressor MSE: 2761.774452312043
Scikit-learn Gradient Boosting 演算法超參數請參考:
Bayesian Optimization#
使用 hyperopt 套件來實作貝氏優化法。相關的套件還有:optuna、 Scikit-Optimize等等。
跟random search一樣,要建立的是超參數的搜尋空間。
這邊超參數的分佈必須使用hyperopt的相關function。
詳細用法請參考:官方說明文件。
# Define the hyperparameter space
param_space = {
    'n_estimators': scope.int(hp.randint('n_estimators', 10, 101)),
    'max_depth': scope.int(hp.randint('max_depth', 1, 6)),
    'min_samples_split': hp.uniform('min_samples_split', 0, 1),
    'criterion': hp.choice('criterion', ['friedman_mse', 'squared_error'])
}
hyperopt必須定義一個objective function。
這個objective function的輸入是選定的特定一組超參數組合,輸出則是衡量模型表現的指標。
注意這邊的正負號:
cross_val_score的score是要越大越好,所以必須使用”neg_mean_squared_error”。
而hyperopt則會讓衡量指標越小越好,所以最後必須乘回一個負號。
# Define objective function for hyperopt
def objective(params):
    model = GradientBoostingRegressor(**params)
    scores = cross_val_score(model, X_train, y_train, cv=5, scoring='neg_mean_squared_error')
    return -np.mean(scores)
以下語法可以執行hyperopt的超參數搜尋流程。
首先要建立Trials物件,這個物件會根據先前的試驗結果來選定下一組要試驗的超參數組合。
接著,建立fmin物件,該物件會依據自身的參數設定,產出最佳的超參數組合。
參數說明如下:
- space為超參數搜尋空間。 
- algo參數則是指定用來估計 objective function 機率分佈的演算法。 
- max_evals是試驗的次數。 
- 最後,傳入Trials物件。 
# Perform Bayesian optimization
trials = Trials()
bayesian_best_params = fmin(
    objective,
    space=param_space,
    algo=tpe.suggest,
    max_evals=24,
    trials=trials,
    rstate=np.random.default_rng(42))
  0%|                                                                                   | 0/24 [00:00<?, ?trial/s, best loss=?]
  8%|████▊                                                     | 2/24 [00:00<00:01, 13.29trial/s, best loss: 3368.723874697856]
 17%|█████████▋                                                | 4/24 [00:00<00:01, 12.69trial/s, best loss: 3368.723874697856]
 25%|██████████████▌                                           | 6/24 [00:00<00:01,  9.61trial/s, best loss: 3368.723874697856]
 33%|███████████████████▎                                      | 8/24 [00:00<00:01, 10.93trial/s, best loss: 3309.551860951369]
 42%|███████████████████████▊                                 | 10/24 [00:00<00:01, 11.41trial/s, best loss: 3309.551860951369]
 50%|████████████████████████████▌                            | 12/24 [00:01<00:01, 11.15trial/s, best loss: 3309.551860951369]
 58%|█████████████████████████████████▎                       | 14/24 [00:01<00:00, 12.09trial/s, best loss: 3306.415696270635]
 67%|██████████████████████████████████████                   | 16/24 [00:01<00:00, 13.63trial/s, best loss: 3306.415696270635]
 75%|██████████████████████████████████████████▊              | 18/24 [00:01<00:00, 12.20trial/s, best loss: 3306.415696270635]
 83%|███████████████████████████████████████████████▌         | 20/24 [00:01<00:00, 12.06trial/s, best loss: 3306.415696270635]
 92%|████████████████████████████████████████████████████▎    | 22/24 [00:01<00:00, 13.58trial/s, best loss: 3306.415696270635]
100%|█████████████████████████████████████████████████████████| 24/24 [00:02<00:00, 11.10trial/s, best loss: 3306.415696270635]
100%|█████████████████████████████████████████████████████████| 24/24 [00:02<00:00, 11.70trial/s, best loss: 3306.415696270635]
如上所述,fmin輸出的是一組超參數組合。
注意這邊’criterion’的取值為1,代表當初搜尋空間中定義的list中,index=1的超參數值。
# Print best hyperparameters
print("Best hyperparameters:", bayesian_best_params)
Best hyperparameters: {'criterion': 1, 'max_depth': 1, 'min_samples_split': 0.4671485658354062, 'n_estimators': 73}
所以利用hyperopt的 space_eval function 轉換為原始的值。
print(space_eval(param_space, bayesian_best_params))
{'criterion': 'squared_error', 'max_depth': 1, 'min_samples_split': 0.4671485658354062, 'n_estimators': 73}
使用最佳的超參數組合,手動執行refit。
bayesian_search = GradientBoostingRegressor(**space_eval(param_space, bayesian_best_params))
# find best hyperparameters
bayesian_search.fit(X_train, y_train)
GradientBoostingRegressor(criterion='squared_error', max_depth=1,
                          min_samples_split=0.4671485658354062,
                          n_estimators=73)In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook. On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.
GradientBoostingRegressor(criterion='squared_error', max_depth=1,
                          min_samples_split=0.4671485658354062,
                          n_estimators=73)# Make predictions
y_pred_best_bayesian = bayesian_search.predict(X_test)
# Calculate Mean Squared Error (MSE) on test set
mse_best_bayesian = mean_squared_error(y_test, y_pred_best_bayesian)
print("Best Gradient Boosting Regressor MSE:", mse_best_bayesian)
Best Gradient Boosting Regressor MSE: 2761.0626899606505
結論#
Bayesian optimization 會有額外的計算時間,若每次train模型的時間很快的話,使用random search可能會更有效率。
 
    
  
  
