Machine learning model selection is the second step of the machine learning process, following variable selection and data cleansing. Selecting the right machine learning model is a critical step, as a model which does not appropriately fit the data will yield inaccurate results. Model selection largely depends on the goal of the model – is the purpose to explore the relationship between the variables or to maximize predictive power? In this blog, we cover a few key concepts of machine learning model selection, including parametic vs. non-parametic models, key metrics for managing the variance-bias tradeoff, and an introduction to a few standard machine learning models.
Parametric vs. Non-Parametric TradeoffsOne of the first choices to be made in the model selection process pertains to our assumption about the shape of the functional relationship between our explanatory variables (our given, or input, variables) and our response variable (the output that we want to predict). When we choose to assume the shape of our model, we are constructing a parametric model, and our problem reduces to estimating a set of measurable factors, known as parameters.1 One of the most common assumptions is that the data is linear. While we can relax the linear assumption when necessary, we sometimes do not want to assume the shape of the function at all. Non-parametric models help to avoid the case where we incorrectly assume a function that does not match the data. However, a much larger number of observations must be obtained to make non-parametric methods effective, which can be costly or even infeasible.2
In addition to the fact that non-parametric methods are often not practical, there are other tradeoffs to take into consideration. One important tradeoff is between interpretability and flexibility. Since non-parametric models follow the data closely, they often result in abnormally shaped plots, which can be difficult to interpret. If the goal is to make sense of and model the relationship between the explanatory variable and the response, we may be willing to trade some predictive power for a parametric curve that is more understandable. If, however, we are comfortable constructing a “black-box” in hopes of maximizing the predictive power of the model, then non-parametric models may be suitable.
Another important tradeoff is that of variance versus bias . Variance, in the context of statistical learning, refers to the amount by which our prediction would change if we had used a different training dataset for our estimation. Bias refers to the error resulting from approximating a complex relationship by using a simplified representation of it. In general, more flexible (non-parametric) methods tend to have higher variance and lower bias, with the opposite being true of less flexible (parametric) models. Ideally though, we want a model that has low variance and low bias. To find it, we most frequently rely on three important tools: R-squared, residual standard error, and diagnostic plots.
R-Squared, Residual Standard Error, and PlotsR-squared—formally, the “coefficient of determination”—measures the amount of variance in the response variable that is explained by the explanatory variables. Constrained between 0 and 1, a very low R-squared can indicate problems with model fit, while a very high R-squared can sometimes indicate overfitting. Residual standard error (RSE) estimates variance in the data. RSE depends on the residual sum of squares—the variation in the data left unexplained after the regression has been run—the number of observations, and the number of explanatory variables.
Graphical plots complement R-squared and RSE. Plots can be as simple as plotting the response variable against a single explanatory variable or against a fitted linear model. This can be useful for detecting non-linearity, but other plots have broader application.
One such plot is the residual plot, which plots the residuals—the difference between the true response variables and the fitted values—and the fitted values themselves. Patterns in residual plots can suggest a lack of model fit, perhaps due to non-constant variance or non-linearity in the data. Outliers and leverage points3 can also be detected through standardized residual, Normal QQ plots, and leverage point/Cook’s distance plots.
Observing these diagnostic plots enables us to make decisions as to what functional form our variables should take. For instance, by taking a logarithmic function (a curved function) of our response variable, we can help to account for non-constant variance in our model, or a non-linear relationship with the explanatory variables. We can also relax the additive assumption in a linear model by adding multiplicative combinations of variables—a technique that helps to model a synergistic relationship between variables.
Machine Learning Models: Shrinkage Methods, Splines, and Decision TreesOur goal is to determine the model with the highest probability of having realistically generated the data, and we have summarized above the most important metrics that can help us identify such a model. However, it is also important to be aware of several standard models—to know ahead of time which are likely to be most useful.
Shrinkage methods are an alternative to the standard linear model and most notably include ridge and lasso regressions. While these models are similar to ordinary least squares, they include a shrinkage “penalty” which shrinks the coefficients, as an increasing function of their magnitude, toward zero. Through adding this constraint, the model can offer a sizeable reduction in variance in exchange for a slight increase in bias. A tuning parameter—a coefficient on this penalty—can help us fine-tune the amount of variance we want to eliminate, as well as bias we are willing to accept.4
If we are looking for a model with more flexibility and predictive power, splines may be an avenue to explore. Splines introduce several “knots” into the model, creating a smooth, continuous line with many different slopes. Unsurprisingly, since splines are much more flexible than linear regression or shrinkage methods, they have a lower bias due to following the data more closely. They also do a better job than polynomial regressions, as they provide more consistent estimates.5
A third option is decision trees, which provide more flexibility, but are also highly interpretable due to the way they segment the problem into a hierarchical structure. The idea is to segment the set of possible values for the random variables into a distinct number of regions and make the same prediction for each observation in a particular region. This is generally done using an algorithm to select the most meaningful way to segment the observations, then the next most, and so on. Once this iterative algorithm is complete, we are left with what is usually a complex, hierarchical tree-like structure that can be readily mapped into a highly intuitive visualization. Decision trees can be very useful for their interpretability, ability to model non-linear data, and arguably more realistic approach to modeling human decision-making.
Application to Finance and Mortgage DataWe can use machine learning to answer a wide variety of questions related to finance and mortgage data, but it is crucial to understand the model selection process. Strong domain knowledge can help considerably in knowing what assumptions would be plausible, but a knowledge of diagnostic metrics, as well as the different types of models, their strengths, and weaknesses, can help unlock insights and uncover the logic behind processes—especially when answering questions that have yet to be answered. Whether your goal is to identify which customers are most likely to default on a loan, determine the elasticity of demand for a certain type of loan, or cut out some of the noise in the data, a solid grounding in approaches to model selection can help significantly.
 Gareth James, Daniela Witten, Trevor Hastie, and Robert Tibshirani, Introduction to Statistical Learning (New York: Springer, 2013), 21-22.
 James, Witten, Hastie, and Tibshirani, 23.
 Outliers are Y values that are unusual given the explanatory variables. Leverage points are X values that are surprising given the response variables.
 James, Witten, Hastie, and Tibshirani, 218.
 James, Witten, Hastie, and Tibshirani, 276.