Ensemble Models

Kavya Hariharan
5 min readSep 22, 2024

--

The biggest mistake I always made was to assume that ensemble models are to join two different existing machine learning models. Although that can be interpreted as correct on certain levels, that is only just a type of Ensembling known as stacking. It never occurred to me that the existing model that we use for almost all problem statements, Random Forest, is an ensemble model itself. So let’s look at a simplified explanation of ensemble models.

As the name suggests ensemble models are defined as the combination of multiple models to make predictions. The main aim of ensemble models is to achieve low bias and variance. Bias is nothing but the errors in a model that can cause inconsistent predictions. Variance is defined as the changes observed in the model during the training phase. Low bias indicates that a model can understand the data properly thereby reducing the room for error and maintaining a low variance suggests that the model does not learn every single nuance thereby causing it to overfit the data.

Now for a clean definition, Ensemble modelling is the combination of multiple base learners to reduce generalisation error of predictions. These base learners are independent models that are combined during Ensembling. As I was understanding these concepts from YouTube in a lecture given by Taylor Sparks, he very beautifully articulated the necessity of an ensemble model.

Take the example of a forest- a forest is made of multiple different trees. Having one single tree is not going to be doing any good but multiple trees can give you a whole forest. Each tree has different characteristics and properties. They all have varied perspectives.

  1. BAGGING

Bagging is the technique of combining multiple homogenous weak learners to amplify the performance of the model. Bagging is also known as bootstraping aggregation. This is a data-specific algorithm in which multiple small subsets of data are made from the actual dataset thereby creating more diverse predictive models. Bagging helps adjust the stochastic distribution of training datasets by combining them via a deterministic average.

Extending the above forest analogy to bagging which is also known as bootstrap aggregating, let us look at how exactly this works. The parent dataset is broken into smaller bags or subsets of data. Multiple base learners are employed to train with each subset of data and all these are weak learners. Each learner can sample different features. This training process is done in parallel thereby saving time. As the learning is independent, it improves efficiency and additionally, the advantageous point is that the sampling of datasets is done with replacement which is crucial in preventing the model from overfitting. If the data is sampled without replacement, there will be no overlap thereby making the model almost completely sensitive to the training data.

Once all these base learners are trained and give their predictions for a specific target variable, the combined model prediction is given either by averaging the value of the results or by majority voting. The former is mainly used for regression problems and the latter for classification problems. Bootstrapping refers to the replicating of the datasets and aggregation refers to the majority voting of the model predictions. Hence this technique is called bagging.

Random forest is an example of a bagging technique as it comprises multiple decision trees that act as the base learner and give a combined prediction result in a classification problem. All the weak learners combine to form one strong model.

2. BOOSTING

Boosting is another method of combining a set of homogenous weak learners. This technique uses sequential learning processes where each model or base learner attempts to learn from this mistake of the previous learner and tries and correct the former’s error.

This is a set of sequentially weak learners which helps reduce the variance and bias. This is slower to train than bagging. Hence this can be understood as combining the base learners in an adaptive deterministic strategy. Each of them a prone to to high variance as they try and correct the previous error which is why it needs to be implemented with shallow trees.

AdaBoost is an example of a machine learning model that iteratively tries and makes the model better and better. The final combined model is a weighted sum of all the weak learners. This helps us decide the attention given to each learner thereby letting them focus on what is required and specifying a certain weightage for each of them.

Boosting

3. STACKING

The final ensemble technique that I want to write about is Stacking. This is an ensemble method that combines two or more heterogenous weak learners. In this method, learning is done independently and in parallel. This takes on the information ( prediction results) from multiple predictive models also known as level 0 models. This is then combined with the meta-model or level 1 model.

Once the base models are trained using the available dataset, the predictions of these are used as additional inputs to the combined algorithm to make the final prediction. We are essentially building a new training set for the meta-model ( level 1) to make its predictions. As the model learns from the different predictions made by different models it can be thought of as data not used to train the base learners are fed to the meta model.

As shown in the figure below, the 3 models are level 0. For instance, it could be an ensemble of random forest, Cat Boost and XG Boost classifiers.

Stacking

That is all one needs to know to understand the basic concepts and workings of ensemble modelling. Hope it helped :)

--

--

Kavya Hariharan
Kavya Hariharan

No responses yet