Fine-tuned variants were over 3 times more susceptible to jailbreak instructions and over 22 times more likely to produce a harmful response than the original model.
SAN FRANCISCO, May 28, 2024 /PRNewswire/ -- Robust Intelligence, the AI application security company, has shared findings from their latest research into fine-tuning and its adverse effects on the safety and security alignment of large language models (LLMs).
Fine-tuning is a common approach employed by organizations to improve the accuracy, domain knowledge, and contextual relevance of an existing foundation model. It effectively helps tailor general purpose models to fit specific AI applications and saves on the otherwise tremendous costs of creating a new LLM from scratch.
However, the latest research from the Robust Intelligence team reveals a danger to fine-tuning that is still unknown to many AI organizations—namely, that fine-tuning can throw off model alignment and introduce security and safety risks that were not previously present. This phenomenon is broadly applicable and can even occur with completely benign datasets, making fine-tuned AI applications generally easier to jailbreak and more likely to produce harmful or sensitive results.
This original research, which examined the popular Meta foundation model Llama-2-7B and three fine-tuned variants published by Microsoft researchers, revealed that the fine-tuned variants were over 3 times more susceptible to jailbreak instructions and over 22 times more likely to produce a harmful response than the original model.
When determining which models would make ideal candidates for evaluation, the team selected Llama-2-7B as a control for its strong safety and security alignment. Reputable Llama-2-7B variants were then chosen for comparison—a set of three AdaptLLM chat models fine-tuned and released by Microsoft researchers to specialize in biomedicine, finance, and law. A benchmark jailbreak dataset from Jailbroken: How Does LLM Safety Training Fail?, Wei et al., 2024 was used to query models and evaluate their responses. Outputs were judged by humans on three criteria: understanding of the prompt directions, compliance with provided instructions, and harmfulness of the response.
"Fine-tuning has become such a ubiquitous practice in machine learning, but its propensity to throw off model alignment is still not widely understood," said Yaron Singer, Chief Executive Officer and co-founder of Robust Intelligence. "Our team conducted this research to underscore the severity of this problem and emphasize how important it is to continuously test the safety and security of your models."
A complete overview of this fine-tuning research, including a detailed walkthrough of testing methodologies, is available here on the Robust Intelligence blog.
To learn how automated AI security and safety testing can identify vulnerabilities across the model lifecycle, visit Robust Intelligence's AI Validation page or schedule a demo.
About Robust Intelligence
Robust Intelligence enables enterprises to secure their AI transformation with an automated solution to protect against security and safety threats. The company's platform includes an engine for detecting and assessing model vulnerabilities, as well as recommending and enforcing the necessary guardrails to mitigate threats to AI applications in production. This enables organizations to meet AI safety and security standards with a single integration, including those from NIST, MITRE ATLAS, and OWASP. Robust Intelligence is backed by Sequoia and Tiger Global and trusted by leading companies including JPMorgan Chase, ADP, Expedia, IBM, and the US Department of Defense to unblock the enterprise AI mission.
Media Contact
[email protected]
SOURCE Robust Intelligence
WANT YOUR COMPANY'S NEWS FEATURED ON PRNEWSWIRE.COM?
Newsrooms &
Influencers
Digital Media
Outlets
Journalists
Opted In
Share this article