posted on 2025-10-27, 15:50authored byHossein Abroshan
Modern machine learning models are vulnerable to data poisoning attacks that compromise the integrity of their training data, with label flipping being a particularly insidious variant. In a label flipping attack, an adversary maliciously alters a fraction of the training labels to mislead the model, which can significantly degrade performance or cause targeted misclassifications while often evading simple detection. In this work, we address this threat by introducing a modular, attack-agnostic detection framework (“AI to Protect AI”) that monitors model behaviour for poisoning indicators without requiring internal access or changes to the target model. A Behaviour Monitoring Module (BMM) continuously observes the model’s outputs, extracting telltale features such as prediction probabilities, entropy, and margins for each input. These features are analysed by an ensemble of detector models, including supervised classifiers and unsupervised anomaly detectors, that collaboratively flag suspicious training samples indicative of label tampering. The proposed framework is dataset-agnostic and model-agnostic, as demonstrated across diverse image classification tasks using the MNIST (handwritten digits), CIFAR-10 (natural images), and ChestXray14 (medical X-rays) datasets. Experimental results indicate that the system reliably detects poisoned data with high accuracy (e.g., an area under the ROC curve exceeding 0.95 on MNIST, above 0.90 on CIFAR-10, and up to 0.85 on ChestXray14), while maintaining low false alarm rates. This work highlights a novel “AI to protect AI” approach, leveraging multiple lightweight detectors in concert to safeguard learning processes across different domains and thereby enhance the security and trustworthiness of AI systems.<p></p>