Ensemble Techniques: Combining Models for Superior Predictive Power

Machine learning is an iterative process, and whether it's adding new features to your data, filling missing values with different techniques, adding new data points, monitoring your predictions, or hyper-parameter tuning; there is this constant search for improvement to machine learning models.

This article like many other machine-learning tutorials, talks about how you can improve your machine-learning model but with a focus on Ensembling machine-learning models, Ensembling can be described as combining different machine-learning models in an attempt to achieve a new model or prediction pattern that combines the strength of the base-models and reduces their flaws. The "in an attempt" used in describing ensembling is very deliberate and will be explained later on.

In this article, I will be discussing different methods of ensembling machine learning models using a machine learning competition I participated in as a case study, and also comparing the results of each ensembling method.

Prerequisite

To understand what will be done in this article, you'll need to have:

Intermediate knowledge of Python.
Basic understanding of Numpy and Pandas.
Basic understanding of simple machine learning algorithms like Linear Regression and Decision Trees
Jupyter Notebook installed in your system or a Google Collaboratory (Google Collab) account.

Dataset used

One of the first problems you would face when trying to build a machine learning model is data. You'll mostly find yourself trying to answer the question "What data should I use to solve XYZ problem?".

For this article, you'll be using data from Kaggle's Regression with a Crab Age Dataset competition. The competition is aimed at predicting the Age of a crab using the physical features of the crab such as weight, height and length of the crab. The evaluation metric from the competition is the Mean Absolute Error. Three CSV files were provided from the competition board; a train (to build your model on), a test (data to predict on) and a sample submission file (showing what your submission should be like). From the problem statement and evaluation metric, this is a regression problem.

Kaggle competitions typically have two scores, a public score ( 20% of the test data) that you can see during the competition and a private score (80% of the test data) that shows after the competition ends, the final leaderboard is calculated using the private scores.

Libraries used

The libraries you'll need to have installed to be able to run the code in this tutorial are: Numpy, Pandas, Scikit-learn, CatBoost, LightGBM and XGBoost, if some of the libraries look strange to you, don't worry at all, you'll see what they do later on.

Data cleaning

Now that you have data, the next step is to clean and prepare the dataset for model building!

Install the necessary libraries.

If you use google collab (that's what I use), all the libraries except Catboost are pre-installed. You can easily install Catboost by running pip install catboost .

If you use any other text editor or Python notebook. You can install all the libraries by running the below line on your terminal:
```
  pip install numpy pandas sklearn catboost lightboost xgboost
```
Import the necessary libraries.

For data cleaning, the basic libraries you'll need are Numpy and Pandas.
```
  import numpy as np
  import pandas as pd
```
Read the train and test files
```
  train=pd.read_csv("train.csv",index_col=0)
  test=pd.read_csv("test.csv",index_col=0)
  print(train.head())
  print(test.head())
```
The first column of the train and test CSV file is the id which according to the competition helps to identify the data points, so the index_col=0 parameter helps you to set it (the 1st column, with index 0) as the index.
Check for missing values

Missing values pose problems when building machine learning models. Most machine learning models can't handle data with missing values and will return an error when you try to build models on such data. Another problem they pose is loss of information. When values are missing, the quality of the data and information that can be gotten by the predictive model reduces.
```
  #check for missing values
  print(train.isna().sum())
  print(test.isna().sum())
```
Thankfully, the data has no missing which makes things a lot easier and you don't have to worry about filling in missing values.

Separate the target column and drop it from the train dataframe

  target=train["Age"]
  train.drop("Age",axis=1,inplace=True)

Check for categorical columns

Further, most models can't handle non-numerical features and it'd be better if you can spot them and convert them to numerical features to make things easier.
```
  #Check for categorical columns
  print(train.dtypes)
  print(test.dtypes)
```
From the results, you can see that only the Sex column has a non-numerical datatype ("object").

Check the unique values in the Sex column and convert them to numerical values.
```
  #Check the unique values in the Sex column
  print(train["Sex"].value_counts())

  #convert to numerical
  train["Sex"]=train["Sex"].map({"M":0,"I":1,"F":2})
  test["Sex"]=test["Sex"].map({"M":0,"I":1,"F":2})
```
All the important data cleaning and handling required have been done, and the data has no missing values or non-numerical features.

Note: You can still do a lot more to improve the data. For instance, you can create new features, scale the data and many more. However, since this tutorial is centered around ensembling machine learning models, we will pay less attention to that.

Split data

To build your model, you'll need 3 different DataFrames - a train (to train the model), validation (to check your model's performance) and test (data to predict on) DataFrame. Although the train and test data have been given already, there is no validation data. No worries though. You can easily get the validation data by splitting the train data into two; X (which becomes the new train dataframe) and val (the validation dataframe). The split can be done using sklearn's train_test_split function.

The target column is also split into y (target for X dataframe ) and y_val (target for val dataframe).

#splitting the train data into [X and val]
from sklearn.model_selection import train_test_split
X,val,y,y_val=train_test_split(train,target,test_size=0.15,random_state=0)

Model Training

To train the model you'll be using 3 different regressors; CatBoostRegressor (from the catboost package), LGBMRegressor (from lightboost/lightgbm package) and XGBRegressor(from xgboost package). These regressors are some of the best performing for tabular data and are based on an ensembling technique called Boosting.

Boosting

Boosting algorithms are machine learning algorithms that start with weak learners (weak models) and iteratively improve the models by correcting mistakes made by the previous models. This iterative process helps the ensemble improve its predictive ability.

The final predictions by boosting algorithms are made by combining the predictions by all the learners using a weighted voting system, weak learners are assigned lower weights while stronger (more accurate) learners are assigned higher weights.

Catboost, lightboost and xgboost are boosting algorithms based on Decision Trees, which means they are made up of multiple decision trees under the hood, although the iterative process and the method they used to improve the weak learners (individual decision trees) are quite different.

Building the models.

You'll build a CatBoostRegressor, LGBMRegressor and XGBRegressor and also create a submission file for each of the models.

CatBoost

Import the regressor

  #import the regressor
  from catboost import CatBoostRegressor

Create the model

  #create the model
  model_catboost=CatBoostRegressor(verbose=0,random_state=0)

Fit the model on the X,y data

  #fit the model on the X,y data
  model_catboost.fit(X,y)

Make predictions on the X,val and test dataframe.

  #predictions on the X,val and test dataframe
  pred_x_catboost=model_catboost.predict(X)
  pred_val_catboost=model_catboost.predict(val)
  pred_test_catboost=model_catboost.predict(test)

Check the mean_absolute_errors of the X and validation datasets.

  #import mean_absolute_error
  from sklearn.metrics import mean_absolute_error

  #check the mean absolute error of the validation and X dataframe
  print(mean_absolute_error(pred_val_catboost,y_val,))
  print(mean_absolute_error(pred_x_catboost,y))

Create a submission file

  sub_catboost=pd.DataFrame({"id":test.index,"Age":pred_test_catboost}).set_index("id")
  sub_catboost.to_csv("submission_catboost.csv")

LightGBM

You just need to repeat what was done on the CatBoostRegressor.

model_lightgbm=LGBMRegressor(random_state=0)
model_lightgbm.fit(X,y)
pred_x_lightgbm=model_lightgbm.predict(X)
pred_val_lightgbm=model_lightgbm.predict(val)
pred_test_lightgbm=model_lightgbm.predict(test)
print(mean_absolute_error(pred_val_lightgbm,y_val,))
print(mean_absolute_error(pred_x_lightgbm,y))
#creating a submision file
sub_lightboost=pd.DataFrame({"id":test.index,"Age":pred_test_lightgbm}).set_index("id")
sub_lightboost.to_csv("submission_lgb.csv")

XGBoost

model_xgboost=XGBRegressor(random_state=0)
model_xgboost.fit(X,y)
pred_x_xgboost=model_xgboost.predict(X)
pred_val_xgboost=model_xgboost.predict(val)
pred_test_xgboost=model_xgboost.predict(test)
print(mean_absolute_error(pred_val_xgboost,y_val))
print(mean_absolute_error(pred_x_xgboost,y))
#creating a submision file
sub_xgboost=pd.DataFrame({"id":test.index,"Age":pred_test_xgboost}).set_index("id")
sub_xgboost.to_csv("submission_xgb.csv")

Voting

Voting is an ensembling technique, it involves taking in different models and creating predictions by making a vote from predictions made by each model. By voting, it means finding the mode of the predictions (for classification problems) or average (for regression problems). The different models could also be given weights, models with higher weights will contribute more to the voting process and are more important.

Voting

For this problem, you can easily implement voting by averaging the three models (catboost regressor, lgbm regressor and xgboost regressor).

mean_predictions=(pred_test_catboost+pred_test_lightgbm+pred_test_xgboost)/3

#create a submission file
sub_mode=pd.DataFrame({"id":test.index,"Age":mean_predictions}).set_index("id")
sub_mode.to_csv("sub_mean.csv")

You can also implement voting by using scikit-learns' VotingRegressor (regression) and VotingClassifier (classification).

#import voting regressor
from sklearn.ensemble import VotingRegressor

#sub_models
estimators=[("catboost",model_catboost),
           ("lightboost",model_lightgbm),
           ("xgboost",model_xgboost)]

#voting regressor
model_voting=VotingRegressor(estimators=estimators)
model_voting.fit(X,y)
pred_x_voting=model_voting.predict(X)
pred_val_voting=model_voting.predict(val)
pred_test_voting=model_voting.predict(test)
print(mean_absolute_error(pred_val_voting,y_val))
print(mean_absolute_error(pred_x_voting,y))

The VotingRegressor also has a weights parameter, an array of shape n_regressors, that weighs the occurrences of predicted values before averaging, if no weights are given, then each model has the same weights and importance.

Create a submission file:

sub_voting=pd.DataFrame({"id":test.index,"Age":pred_test_voting}).set_index("id")
sub_voting.to_csv("submission_voting.csv")

Stacking

Stacking is an ensembling technique that takes in different models (base models) and creates a new model (meta-model) that makes predictions using the output from the base models. Stacking tries to take voting a step further by using a machine learning model (instead of simple averaging done in voting) to learn how best to combine the base models' predictions to improve overall predictive performance.

Stacking tries to leverage the strengths of the base models and compensate for their weaknesses by combining the predictions from each model in a sophisticated manner, aiming to identify more complex patterns in the data.

Simple implementation of stacking

Build a dataframe of the model's predictions

  X_predictions_dataframe=pd.DataFrame({"catboost":pred_x_catboost,
                               "lightboost":pred_x_lightgbm,
                               "xgboost":pred_x_xgboost})
  test_predictions_dataframe=pd.DataFrame({"catboost":pred_test_catboost,
                               "lightboost":pred_test_lightgbm,
                               "xgboost":pred_test_xgboost})
  val_predictions_dataframe=pd.DataFrame({"catboost":pred_val_catboost,
                               "lightboost":pred_val_lightgbm,
                               "xgboost":pred_val_xgboost})

Build the meta model

  #import linear regressor
  from sklearn.linear_model import LinearRegression

  final_model=LinearRegression()
  final_model.fit(X_predictions_dataframe,y)
  pred_x_final=final_model.predict(X_predictions_dataframe)
  pred_val_final=final_model.predict(val_predictions_dataframe)
  pred_test_final=final_model.predict(test_predictions_dataframe)
  print(mean_absolute_error(pred_val_final,y_val))
  print(mean_absolute_error(pred_x_final,y))

Create a submission file

  sub_final=pd.DataFrame({"id":test.index,"Age":pred_test_final}).set_index("id")
  sub_final.to_csv("submission_final.csv")

Sklearn's StackingRegressor

Sklearn has a StackingRegressor class that accepts a list of base estimators and a final estimator (meta-model to use on the base estimators).

from sklearn.ensemble import StackingRegressor

model_stack=StackingRegressor(estimators=estimators,final_estimator=final_model)
model_stack.fit(X,y)
pred_x_stack=model_stack.predict(X)
pred_val_stack=model_stack.predict(val)
pred_test_stack=model_stack.predict(test)
print(mean_absolute_error(pred_val_stack,y_val))
print(mean_absolute_error(pred_x_stack,y))

#creating a submission file
sub_stack=pd.DataFrame({"id":test.index,"Age":pred_test_stack}).set_index("id")
sub_stack.to_csv("submission_stack.csv")

Stacking is prone to overfitting, so the StackingRegressor tries to reduce overfitting by using cross-validation to fit the final estimator on the predictions from the base models.

Bagging

Bagging (full-form Bootstrap Aggregation) is an ensemble method that improves predictive performance by combining models trained on different subsamples of the data and combining them to obtain more rounded results.

When building models, you typically split the train data into two (X and validation), by training the model on X alone, you've lost all the patterns that can be learned from the validation data, with bagging, you can split your data into multiple folds, validate the model on one fold and build the model on the others, do this until you've used all the folds as validations. So, if your data was split into 10 folds, you can build models on 10 subsamples of the data and combine their predictions.

Manual implementation

from sklearn.model_selection import KFold

#cross_validator to splite the data into folds
folds=KFold(n_splits=8,shuffle=True,random_state=0)

#a dataframe to store the predictions made by each fold
predictions_df=pd.DataFrame()

#list to save the mean absolute errors from validatingon each folds
mae_val=[]
mae_X=[]

#a simple catboost regressor
model=CatBoostRegressor(verbose=0,random_state=0)

#train model, make predictions and check the validation accuracy on  each fold
for i,(train_index,test_index) in enumerate(folds.split(train,target)):
    train_fold=train.iloc[train_index]
    val_fold=train.iloc[test_index]
    y_fold=target.iloc[train_index]
    y_val_fold=target.iloc[test_index]
    model.fit(train_fold,y_fold)
    print(i)
    prediction=model.predict(test)
    predictions_df[i]=prediction
    mae_val.append(mean_absolute_error(model.predict(val_fold),y_val_fold))
    mae_X.append(mean_absolute_error(model.predict(train_fold),y_fold))
print(mae_val)
print(mae_X)

Up next is to make predictions by combining the results from all the folds, this can be done by finding the mean.

predictions=predictions_df.mean(axis=1)
sub_kfold_mean=pd.DataFrame({"id":test.index,"Age":predictions}).set_index("id")
sub_kfold_mean.to_csv("submission_kfold_mean.csv")

SKlearn's BaggingRegressor

from sklearn.ensemble import BaggingRegressor

model_bagging=BaggingRegressor(estimator=CatBoostRegressor(random_state=0,verbose=0),n_estimators=8,max_samples=0.8)
model_bagging.fit(train,target)

pred_test_bagging=model_bagging.predict(test)

sub_bagging=pd.DataFrame({"id":test.index,"Age":pred_test_bagging}).set_index("id")
sub_bagging.to_csv("submission_bagging.csv")

A good question that you may have at this point is what is the need for manual implementation since there is the BaggingRegressor class? Well, the answer is flexibility, you'd have a lot of flexibility and freedom if you implement it manually, for example, you can choose to use the median (even minimum or maximum) of the predictions instead of the mean if you want to reduce the effect of outliers.

predictions=predictions_df.median(axis=1)
sub_kfold_median=pd.DataFrame({"id":test.index,"Age":predictions}).set_index("id")
sub_kfold_median.to_csv("submission_kfold_median.csv")

Effects of ensembling on the model's performance?

Submitting the submission files to the competition's page gives the following scores.

From the results, the ensembled models generally outperform the individual models (except for the stacking implementation by hand which overfits). The mean of the bagging implementation by hand performs best (in terms of private score).

What is the best ensembling method?

You never really know until you try, the method that gives you the best performance for problem A might not give the best performance for problem B, so it's important to try first, you can use the performance on validation data to get a closer view of how the ensembling method will perform on new data.

In some cases, not all the ensembling approaches improve the performance of individual models, hence the reason for the "in an attempt" used in describing ensembling earlier. Ensembling won't always lead to an increase in performance, although it does a lot of times.

The full source code can be found here.

Is ensembling worth it?

Ensembling generally comes with benefits like improvements in performance, reduction in model bias and robustness to noisy data. In some cases, single well-tuned models can achieve satisfactory performance without the added complexity of an ensemble.

Ensembling also has some downsides like additional resources for training and prediction, an increase in computation cost, and a reduction in model explainability. So to answer, the question, it depends on your project, if you prioritize improvement in performance over model explainability or computation cost and resources, then ensembling is perfectly fine.

Conclusion

Four main methods of ensembling machine learning models were discussed in this article, you are not limited to these methods but by your creativity. Any creative ways you can decide to combine the results of models to make new predictions counts as ensembling.

Thank you!