TrainingTestDegradation
Tests if model performance degradation between training and test datasets exceeds a predefined threshold.
Purpose: The ‘TrainingTestDegradation’ class serves as a test to verify that the degradation in performance between the training and test datasets does not exceed a predefined threshold. This test serves as a measure to check the model’s ability to generalize from its training data to unseen test data. It assesses key classification metric scores such as accuracy, precision, recall and f1 score, to verify the model’s robustness and reliability.
Test Mechanism: The code applies several predefined metrics including accuracy, precision, recall and f1 scores to the model’s predictions for both the training and test datasets. It calculates the degradation as the difference between the training score and test score divided by the training score. The test is considered successful if the degradation for each metric is less than the preset maximum threshold of 10%. The results are summarized in a table showing each metric’s train score, test score, degradation percentage, and pass/fail status.
Signs of High Risk: - A degradation percentage that exceeds the maximum allowed threshold of 10% for any of the evaluated metrics. - A high difference or gap between the metric scores on the training and the test datasets. - The ‘Pass/Fail’ column displaying ‘Fail’ for any of the evaluated metrics.
Strengths: - This test provides a quantitative measure of the model’s ability to generalize to unseen data, which is key for predicting its practical real-world performance. - By evaluating multiple metrics, it takes into account different facets of model performance and enables a more holistic evaluation. - The use of a variable predefined threshold allows the flexibility to adjust the acceptability criteria for different scenarios.
Limitations: - The test compares raw performance on training and test data, but does not factor in the nature of the data. Areas with less representation in the training set, for instance, might still perform poorly on unseen data. - It requires good coverage and balance in the test and training datasets to produce reliable results, which may not always be available. - The test is currently only designed for classification tasks.