Hyperparameter Tuning for Supervised Learning

  • Intro
  • Design
  • Hyperparameter Tuning Introduction
  • Loading the Dataset
  • Tuning Classification Model
  • Tuning Regression Model
  • Conclusion

Information

Primary software used Python
Software version 1.0
Course Hyperparameter Tuning for Supervised Learning
Primary subject AI & ML
Secondary subject General
Level Intermediate
Last updated December 19, 2024
Keywords

Responsible

Teachers
Faculty

Hyperparameter Tuning for Supervised Learning 0/6

Hyperparameter Tuning for Supervised Learning link copied

Intro, we are predicting how satisfied people are with their house. To do so we will use random forest classification.

Hyperparameter Tuning for Supervised Learning 1/6

Design link copied

Overview

To discuss and understand hyper parameter turning, we will use the cleaned version of the following data set:

Hyperparameter Tuning for Supervised Learning 2/6

Hyperparameter Tuning Introduction link copied

Importance of hyperparameter selection

  • max_depth 
  • max_features 
  • n_estimators 
  • random_state

Random State

We will also assign a random state so we can have the same results every time we run the code. Much of the machine learning code uses a random integer to initialize the algorithm. By assigning an integer to the random_state variable in the different functions, we will have the same results every time we run the code. Using a variable that is set here, allows you to change it in only one place. Set the random_int value here. If you choose any integer, it will give the same results every time you run the rest of the notebook. If you set random_int to None, then you will always have different results when you run the notebook. 

Setting the random_int variable

Random Forest / Decision Trees -> Number of estimators

The Random Forest algorithm trains a certain number of Decision Trees and uses the most common resulting output as the resulting value for a data point. First, we will train a model using the default settings. It is important to keep in mind that this model will likely perform the best because the default value for the “maximum depth” parameter of the algorithm is none. When “maximum depth” is none, the model continues to split until all leaves are pure. This leads to overfitting. So, after seeing the performance with default parameters, we will turn certain hyperparameters including the maximum depth and maximum features. We will use 200 for the number of estimators and default settings for the other hyperparameters.

Hyperparameter Tuning for Supervised Learning 3/6

Loading the Dataset link copied

Loading the dataset

In this case, we will load the clean data set from a CSV file into a Pandas Dataframe. If you need additional information about loading a dataset, review the Clean Data Set tutorial which provides more details. 

Remember to update the folder location highlighted in blue.

A view of the Jupyter Notebook showing the code that loads the data set.
A view of the Jupyter Notebook showing the code that loads the data set.

Now that we loaded a clean data set, we can train a model with it. To have a high-performing machine learning model, we need to adjust the hyperparameters this is similar for the classification and regression models. The following section walks through those steps.

Hyperparameter Tuning for Supervised Learning 4/6

Tuning Classification Model link copied

Overview

When adjusting hyperparameters, we need to somehow evaluate the model performance. A classification model will be evaluated by its accuracy. This is the number of predicted values that the model has gotten correct when compared to the true classification value.

Extracting Dataset to Input and output

In this training data set, there are several factors that we can use as either an input or an output. In this case, we will predict the ‘twoning’ which is a survey response of the respondent’s satisfaction with their current home. This response is one of 5 classes (1 to 5) with 1 being very satisfied and 5 being very dissatisfied. We are going to determine which features are most important in determining people’s satisfaction with their home. This means that we will extract this data as our Y value or our predicted value. The rest of the columns will be the X value, or the feature values.

Splitting the data set into the X values (features) and y values (actual classification).
Splitting the data set into the X values (features) and y values (actual classification).

Train/Test Split

Random forest does not require normalizing data, but we must split it into training and test sets.

Splitting the data set into training and testing sets.
Splitting the data set into training and testing sets.

Tuning Hyperparameters of a Random Forest Model

First, we will train a baseline classification model to which to compare against. Then we will tune certain hyperparameters including the maximum depth and maximum features. We will use 200 for the number of estimators and default settings for the other hyperparameters. Remember that this model with the default settings will likely perform the best because the default value for maximum depth is none which means that model continues to split until all leaves are pure. This leads to overfitting. 

We start by training a baseline algorithm with the following hyperparameter settings:

  • n_estimators=200
  • random_state=random_int
Training a model with default settings to see the results.
Training a model with default settings to see the results.

Next, we will find the optimal max_depth value by testing different max_depth values and plotting the accuracy.

Image of the graph showing the best max_depth value to choose.
Image of the graph showing the best max_depth value to choose.

From the above noted plot, we can see that setting the maximum depth at 5 or 6 is a good option. We want to select a number that is at a point where the curve of the training data slightly flattens before a steep increase in accuracy and where the curve for the test data is fairly flat.

Next, we will find the optimal max_features  value by testing different values and plotting the accuracy.

Image of the graph showing the best max_features value to choose.

From the graph we can see that the maximum features value does not have a huge impact on the accuracy. The graphs are fairly flat because the difference between each mark on the y axis is 0.01. Still there is a slight difference. A good value for the maximum features hyperparameters is around 0.45 because the training accuracy of the test set flattens out around this value.

Based on this process, we determine that our model can reach an accuracy of approximately 64.67% with the following hyperparameters. 

  • max_depth=6
  • max_features=0.45
  • n_estimators=200
  • random_state=random_int

Now we can use this model to predict the twoning which is the value residents placed rating their satisfaction with their current home.

The model with the final hyperparameters.

Hyperparameter Tuning for Supervised Learning 5/6

Tuning Regression Model link copied

Overview

In this section we will complete the same steps using a regression model.

In a classification model, we were able to compare the actual class to the predicted class to get the model accuracy. A regression model is instead evaluated by the R-squared score. It measures how far away the actual data points are to the regression function. An R-squared value of 1 indicates that the model explains the variability in the data set. A higher R-square value suggests a better fit to the data but it does not necessarily describe a better performing model.

The mean absolute error (MAE) measures the error between the predicted values and the true values. MAE takes an average by taking a total of the errors and dividing it by the total sample size. It is one of the options for assessing regression models. A lower values (close to 0.0) shows a more successful model.

Extracting Dataset to Input and output

In this training data set, there are several factors that we can use as either an input or an output. For a regression model, we are looking for a parameter that has a continuous value. In this case we will look at the parameter “percwelvaart” which is the prosperity indicator on a scale between 0 and 100. We can print the y variable to see the values.

Splitting the data set into the X values (features) and y values (actual classification) and printing the y values to visualize them.

Train/Test Split

Random forest does not require normalizing data, but we must split it into training and test sets.

Splitting the data set into training and testing sets.
Splitting the data set into training and testing sets.

Tuning Hyperparameters of a Random Forest Model

Again, we will tune the hyperparameters, but first train a model with default settings. Remember that this model with the default settings will likely perform the best because the default value for maximum depth is none which means that model continue to split until all leaves are pure. This leads to overfitting.

We start by training a baseline algorithm with the following hyperparameter settings:

  • n_estimators=200
  • random_state=random_int
Training a model with default settings to see the results.
Training a model with default settings to see the results.

Next, we will find the optimal max_depth value by testing different max_depth values and plotting the accuracy.

Image of the graph showing the best max_depth value to choose.
Image of the graph showing the best max_depth value to choose.

From the above noted plot, we can see that setting the maximum depth at 3 is a good option. We want to select a number that is at a point right before the curve flattens out. 

Next, we will find the optimal max_features value by testing different values and plotting the accuracy.

Image of the graph showing the best max_features value to choose.
Image of the graph showing the best max_features value to choose.

From the graph we can see that the maximum features value should be between 0.6 and 0.8, right before the curve flattens.

Based on this process, we determine that our model can reach an accuracy of approximately 64.67% with the following hyperparameters.

  • max_depth=3
  • max_features=0.8
  • n_estimators=200
  • random_state=random_int

Now we can use this model to predict the percwelvaart (prosperity indicator).

The model with the final hyperparameters.

Hyperparameter Tuning for Supervised Learning 6/6

Conclusion link copied

From this tutorial, we can see some techniques to select the right hyperparameter settings when training supervised learning models. We used Random Forest as an example, but you can use the same strategies for other models too.

From the regression model, we can see that training a regression model to predict the prosperity indicator leads to a reliable model. This shows us that there is a strong correlation between the factors in the data set and the prosperity rating. By narrowing down specific features, we can also see which features have the highest impact on the prosperity indicator. This additional step is not completed in this notebook.