Article
Version 1
Preserved in Portico This version is not peer-reviewed
An Embarrassingly Simple Method to Compromise Language Models
Version 1
: Received: 28 June 2024 / Approved: 28 June 2024 / Online: 29 June 2024 (06:22:54 CEST)
How to cite: Wang, J. An Embarrassingly Simple Method to Compromise Language Models. Preprints 2024, 2024062045. https://doi.org/10.20944/preprints202406.2045.v1 Wang, J. An Embarrassingly Simple Method to Compromise Language Models. Preprints 2024, 2024062045. https://doi.org/10.20944/preprints202406.2045.v1
Abstract
Language models like BERT dominate current NLP research due to their robust performance, but they are vulnerable to backdoor attacks. Such attacks cause the model to consistently generate incorrect predictions when specific triggers are present in the input, while maintaining normal behavior on clean inputs. In this paper, we propose a straightforward data poisoning method targeting the BERT architecture. Our approach does not involve complex modifications to the model or its training process; instead, it relies solely on altering a small portion of the training data. By introducing simple perturbations into just 10% of the training dataset, we demonstrate the feasibility of injecting a backdoor into the model. Our experimental results show a high attack success rate, indicating that the model trained on the poisoned data can reliably associate the trigger with the attacker's desired outputs while its performance on clean data remains unaffected. This highlights the stealth and effectiveness of our method, emphasizing the need for improved defensive strategies to protect against such threats. Our study underscores the critical importance of ongoing research and development to safeguard AI systems from malicious exploitation, ensuring the security and reliability of NLP applications.
Keywords
transformer; attack
Subject
Computer Science and Mathematics, Computer Science
Copyright: This is an open access article distributed under the Creative Commons Attribution License which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.
Comments (0)
We encourage comments and feedback from a broad range of readers. See criteria for comments and our Diversity statement.
Leave a public commentSend a private comment to the author(s)
* All users must log in before leaving a comment