- Published on
Evaluating ML Models: Cross Validation, Confusion Matrix, Bias–Variance
- Authors
- Name
- Advait Lonkar
- @advait_l
Machine learning is about making predictions and classifications. A central theme in building models that generalize is the bias–variance tradeoff.
- Estimate parameters for ML methods → train on training data (about 75% of data)
- Evaluate how well the methods work → test on held-out data (about 25% of data)
See also: Cross Validation, Confusion Matrix, Bias, Variance
Cross Validation
- Start with a split (e.g., first 75% to train, last 25% to test)
- Then rotate which 25% is used as the test set so each block is tested once
- Keep track of the ML parameters and performance across folds
Common schemes:
- Four-fold cross validation (each fold is ~25%)
- Leave-one-out cross validation (LOOCV)
- Ten-fold cross validation (widely used in practice)
Confusion Matrix
After splitting with cross validation:
Train candidate ML methods on the training split(s)
Evaluate on the test split(s) and summarize with a confusion matrix
Rows = model predictions, Columns = ground truth
True Yes | True No | |
---|---|---|
Predicted Yes | True Positives | False Positives |
Predicted No | False Negatives | True Negatives |
- The diagonal values indicate correct classifications
- Confusion matrices from multiple ML methods can be compared
- More sophisticated metrics can then guide model selection
- For multi-class problems, the matrix expands with one row/column per class
Sensitivity and Specificity
Computed from the columns of the binary confusion matrix:
- Sensitivity (Recall/TPR) = true positives / (true positives + false negatives)
- Interprets as: of all actual positives, what percent were correctly identified?
- Specificity (TNR) = true negatives / (true negatives + false positives)
- Interprets as: of all actual negatives, what percent were correctly identified?
Depending on the application, one may prioritize sensitivity or specificity. For multi-class settings, compute class-specific sensitivity and specificity for each class.
Bias and Variance
- Bias: systematic error from an algorithm’s inability to capture the true relationship
- Example: a straight line fit to a curved relationship has high bias
- Generalization error on held-out data can be decomposed (conceptually) into bias and variance contributions; in practice, we assess with test error by measuring prediction residuals on the test set and aggregating (e.g., sum of squared errors for regression)