ClusterPerformanceMetrics
Evaluates the performance of clustering machine learning models using multiple established metrics.
Purpose:
The ClusterPerformanceMetrics
test is used to assess the performance and validity of clustering machine learning models. It evaluates homogeneity, completeness, V measure score, the Adjusted Rand Index, the Adjusted Mutual Information, and the Fowlkes-Mallows score of the model. These metrics provide a holistic understanding of the model’s ability to accurately form clusters of the given dataset.
Test Mechanism:
The ClusterPerformanceMetrics
test runs a clustering ML model over a given dataset and then calculates six metrics using the Scikit-learn metrics computation functions: Homogeneity Score, Completeness Score, V Measure, Adjusted Rand Index (ARI), Adjusted Mutual Information (AMI), and Fowlkes-Mallows Score. It then returns the result as a summary, presenting the metric values for both training and testing datasets.
Signs of High Risk:
- Low Homogeneity Score: This indicates that the clusters formed contain a variety of classes, resulting in less pure clusters.
- Low Completeness Score: This suggests that class instances are scattered across multiple clusters rather than being gathered in a single cluster.
- Low V Measure: This would report a low overall clustering performance.
- ARI close to 0 or Negative: This implies that clustering results are random or disagree with the true labels.
- AMI close to 0: It means that clustering labels are random compared with the true labels.
- Low Fowlkes-Mallows score: Signifies less precise and poor clustering performance in terms of precision and recall.
Strengths:
- Provides a comprehensive view of clustering model performance by examining multiple clustering metrics.
- Uses established and widely accepted metrics from scikit-learn, providing reliability in the results.
- Able to provide performance metrics for both training and testing datasets.
- Clearly defined and human-readable descriptions of each score make it easy to understand what each score represents.
Limitations:
- It only applies to clustering models; not suitable for other types of machine learning models.
- Does not test for overfitting or underfitting in the clustering model.
- All the scores rely on ground truth labels, the absence or inaccuracy of which can lead to misleading results.
- Does not consider aspects like computational efficiency of the model or its capability to handle high dimensional data.