Mar 30, 20233 min read

Balancing Model Interpretability with Model Complexity: A Suggested Framework

All models are problem specific, where their design and implementation should reflect the types of questions an organization wants to answer. In general, data scientists should tell a story about their model-making process, beginning with interpretable models built on underlying assumptions about their data and scaling upwards in complexity only when those interpretable models fail to explain the response variable(s) adequately.

Are the objectives of an organization to test claims about the effect of predictor variables on the response variable? If so, the family of Generalized Linear Models (GLMs) make certain assumptions about the distribution of the response variable to run statistical parametric tests (i.e., Z-tests, t-tests, ANOVA, Deviance tests, Likelihood-Ratio tests, Goodness-of-Fit tests) on predictors that determine their influence on the response variable as explained by the model. Adding predictors to a GLM is somewhat equivalent to controlling for them in an experiment, where their variability is accounted for by the model and held constant when examining the effects of other predictors.

Predictor terms of a GLM are entered into the model additively, giving them their simple interpretation. Predictors can be modeled as main effects, be multiplied together as interaction terms that examine the effect of changing one predictor while holding another predictor constant (i.e., examining the interaction between a patient’s weight and their blood pressure), and include higher-order terms that better fit the response variable (i.e., entering age as a quadratic function that increases, peaks at a certain value, and decreases from that peak). What if there’s more than one response variable that the team wants to predict? Depending on the strength of linear associations between pairs of response variables (modeled visually by scatterplots and quantitatively with correlograms), multivariate multiple regression can efficiently factor in these dependent variable correlations, supposing that certain assumptions are satisfied in the model (i.e., the dependent variables are multivariate normally distributed, residuals are random, etc.).

If predictive accuracy is the objective of the team, machine learning algorithms making less assumptions about the data than their parametric statistical counterparts may prove advantageous. In general, the model-making process should start simple, only scaling upwards in complexity if the model inadequately fits the data—referred to as underfitting. Perhaps the regression assumptions are violated, the underlying response distributions are unknown and cannot be inferred, the data is unstructured and cannot be easily or efficiently processed for statistical models to learn, or the regression model isn’t capturing the relationship between the response and predictor variables. Where should data scientists turn to next? Supervised learning algorithms that train themselves on a labeled set of data to learn important features and map them to non-linear functions for prediction in the model. Examples include random forest decision trees, support vector machines (SVMs), and gradient boosting for transforming weak learning models into strong ones.

Neural networks, a major component of deep learning, also provide an advantageous way of accepting both structured and unstructured data, mapping it to non-linear functions over several layers of connected neurons, and providing single and multiple outputs (capable of simultaneously outputting regression predictions and classifications in one network) that best minimize some target loss function. Performing gradient descent, the network updates the weights of its neurons to best minimize the loss function from the training data. Neural networks are prone to overfitting when trained on the same data for too many cycles (epochs) or by specifying too many layers / neurons (trainable parameters) in the network. Techniques like weight regularization, dropout, and data augmentation can reduce the issue of overfitting and better generalize the network to unseen data for accurate predictions. Interpreting neural networks and their learned connections between layers is more difficult than other machine learning algorithms, with data scientists relying on validation metrics during network training to guide them in checking whether over- or under-fitting occurred. However, given their versatility in accepting and outputting unstructured data like images, text, and videos in addition to traditional structured data (i.e., flat files), neural networks are an exceptional method of modeling a wide variety of big data through customizable network architectures.

Balancing Model Interpretability with Model Complexity: A Suggested Framework

Comments

Directory

Balancing Model Interpretability with Model Complexity: A Suggested Framework

Comments

Directory

​