We are quite proud of the ability to develop performant, stable and trustworthy predictive models here at Principa. For nearly 20 years, we have been developing predictive models that have helped so many of our clients to make better decisions, more often than not outperforming what our best competitors can achieve. The models that we have historically developed can be categorised as part of the additive group of models – that is, a handful of predictive characteristics are selected and classed in a way that best separates the ‘goods’ from the ‘bads’ (i.e. the traditional binary classification application). Depending on the new unseen data, the resulting weightings are then added together to get a final score. For example, consider a 3-feature model that uses only Home Ownership, Years at Employer and Age. Let’s say you are a homeowner and for this you get 10 points, you have been with your employer for 5+ years (15 points), and you are 23 years of age (8 points), then your final score is 33, and the strategy will use this score and decide where you should go in the business decision tree.
We follow our own recipe that we have refined over the years when developing these models: making sure that the output from the models will fulfil the business’ needs, carefully specifying what is required during the data sourcing phase, and carefully understanding the underlying data before even thinking about developing the models, amongst many other things. Basically, working towards a “no surprises” delivery of the models into production for our clients. On the modelling side, let’s say that we have also learnt a few tricks on the way. Yes, our modelling approach can sometimes be compared to the standard logistic regression, but our models are similar to logistic regression models only in the final structure of the model. We use advanced techniques to give us close-to-optimal bins and weightings. Let’s not go too much into our technical approach here, but suffice it to say that our models (a) successfully separate the goods from the bads (b) provide accurate point predictions that can be used with confidence in a business strategy and (c) do not degrade rapidly over time, due to the inherent nature of the additive models and how they are constructed.
Enter the wave of machine learning.
It took us a while to realise that we have been developing machine learning models since the beginning of our time. We don’t brag about this as we know it is not entirely true. We have been building statistical models that are retrained only when we see a degradation in performance (e.g. they are deployed in stable environments and only need get redeveloped every year or even every few years, really!). Some might consider these models to fall outside of machine learning. There are numerous debates about what machine learning is and what it is not – you can find a wholesome debate here: https://stats.stackexchange.com/questions/158631/why-is-logistic-regression-called-a-machine-learning-algorithm.
Constructing machine learning models is generally regarded as a computer science challenge – i.e. the underlying model is more complex in nature, and the challenges faced relate more to computational efficiencies (i.e. quicker processing of the complex trees or neural net) than nuances around the statistical approach. For example, the well-established and often Kaggle.com winning machine learning algorithm is Gradient Boosted Machine (GBM) which, during construction of the model, is computationally expensive, especially on large training sets. Depending on the size of the data that you have to work with and your computing power, it can take hours to train a model as there are often thousands of trees that need to be optimally constructed and tested for a range of tuning parameters. This is often the trade-off between statistical and machine learning approaches. Statistical models are quick to re-class but require an analyst to construct the model carefully and fine-tune the bins to ensure stability in the model, avoid overfitting, etc. On the other hand, the construct of machine learning models depends on parameters and hyperparameters that can quite dramatically affect the performance of the final model. Where the underlying training is non-linear in nature (as is often the case), it is not easy to find the optimal parameters and the best-known method is to run multiple experiments using a grid search or other more advanced approaches that are also computationally expensive (e.g. using Bayesian techniques or genetic algorithms to find the optimal parameters).
There have been great strides in optimising how quickly these models are constructed, often using C++ to do the heavy computational lifting. For example, going from the libraries GBM to XGBoost to LightGBM can reduce the processing times by the order of 5 times from one to the next! For example, it can take a GBM library say 100 minutes to fit a new model, 20 minutes for XGBoost and 4 minutes for LightGBM. What this gives you is the ability to run more experiments in the same time that you have available, or the same number of experiments in a shorter time, or somewhere in between. We have a good track record now regarding constructing machine learning models, with a few models deployed into production as a challenger to our incumbent statistical model. The performance is very similar (sometimes slightly in favour of the boosted algorithm) which is an excellent outcome for us as the bar was set very high with our statistical models.
The real benefit of the machine learning approach is that it lends itself to efficient retraining of the models into the future. This is particularly true if you have set up a machine learning pipeline that provides a data feedback loop, updating your training dataset as the outcomes are observed from ongoing campaigns. Efficient retraining of the models means that the model’s performance does not degrade over time and the model adjusts for possible fundamental shifts in the operating environment. There are rapid and ongoing advances concerning the tools available (often open-source) to efficiently construct performant models without the need to employ and run a large team of data scientists. We still construct both statistical and machine learning models, where challenges around deployment are often the deciding factor. We are in a fortunate position to see how these two different approaches compare to each other and it is going to be interesting to see how things unfold as the machine learning landscape changes and develops further.