Researchers find multiple ways to bypass AI chatbot safety rules

July 29, 2023
68 views

Preventing artificial intelligence chatbots from creating harmful content may be more difficult than initially believed, according to new research from Carnegie Mellon University which reveals new methods to bypass safety protocols.

Popular AI services like ChatGPT and Bard use user inputs to generate useful answers, including everything from generating scripts and ideas to entire pieces of writing. The services have safety protocols which prevent the bots from creating harmful content like prejudiced messaging or anything potentially defamatory or criminal.

Inquisitive users have discovered “jailbreaks,” a framing device that tricks the AI to avoid its safety protocols, but those can be patched easily by developers.

A popular chatbot jailbreak included asking the bot to answer a forbidden question as if it was a bedtime story delivered from your grandmother. The bot would then frame the answer in the form of a story, providing the information it would not otherwise.

The researchers discovered a new form of jailbreak written by computers, essentially allowing an infinite number of jailbreak patterns to be created.

“We demonstrate that it is in fact possible to automatically construct adversarial attacks on [chatbots], … which cause the system to obey user commands even if it produces harmful content,” the researchers said. “Unlike traditional jailbreaks, these are built in an entirely automated fashion, allowing one to create a virtually unlimited number of such attacks.”

“This raises concerns about the safety of such models, especially as they start to be used in more autonomous fashion,” the research says.

To use the jailbreak, researchers added a seemingly-nonsensical string of characters to the end of usually-forbidden questions, such as asking how to make a bomb. While the chatbot would usually refuse to answer, the string causes the bot to ignore its limitations and give a complete answer.

Researchers provided examples using the market-leading tech ChatGPT, including asking the service how to steal a person’s identity, how to steal from a charity and to create a social media post that encourages dangerous behavior.

The new type of attack is effective at dodging safety guardrails in nearly all AI chatbot services on the market, including open source services and so-called “out-of-the-box” commercial products like ChatGPT, OpenAI’s Claude and Microsoft’s Bard, researchers said.

OpenAI developer Anthropic said the company is already working to implement and improve safeguards against such attacks.

“We are experimenting with ways to strengthen base model guardrails to make them more ‘harmless,’ while also investigating additional layers of defense,” the company said in a statement to Insider.

The rise of AI chatbots like ChatGPT took the general public by storm earlier this year. They have seen rampant use in schools by students looking to cheat on assignments, and Congress even limited the programs’ use by its staff amid concerns that the programs could lie.

Along with the research itself, the authors at Carnegie Mellon included a statement of ethics justifying the public release of their research.

Source: The Hill