Researchers train AI chatbots to 'jailbreak' rival chatbots - and automate the process

NTU Researchers were able to jailbreak popular AI chatbots including ChatGPT, Google Bard, and Bing Chat. With the jailbreaks in place, targetted chatbots would generate valid responses to malicious queries, thereby testing the limits of large language model (LLM) ethics. This research was done by Professor Liu Yang and NTU PhD students Mr Deng Gelei and Mr Liu Yi who co-authored the paper and were able to create proof-of-concept attack methods.

The method used to jailbreak an AI chatbot, as devised by NTU researchers, is called Masterkey. It is a two-fold method where the attacker would reverse engineer an LLM's defense mechanisms. Then, with this acquired data, the attacker would teach another LLM to learn how to create a bypass. This way, a 'Masterkey' is created and used to attack fortified LLM chatbots, even if later patched by developers.

AI's Strength is its Own Achilles Heel

Professor Yang explained that jailbreaking was possible due to an LLM chatbot's ability to learn and adapt, thus becoming an attack vector to rivals and itself. Because of its ability to learn and adapt, even an AI with safeguards and a list of banned keywords, typically used to prevent generating violent and harmful content, can be bypassed using another trained AI. All it needs to do is outsmart the AI chatbot to circumvent blacklisted keywords. Once this is done, it can take input from humans to generate violent, unethical, or criminal content.

Article continues below

NTU's Masterkey was claimed to be three times more effective in jailbreaking LLM chatbots than standard prompts normally generated by LLMs. Due to its ability to learn from failure and evolve, it also rendered any fixes applied by the developer eventually useless. Researchers revealed two example methods they used to get trained AIs to initiate an attack. The first method involved creating a persona which created prompts by adding spaces after each character, bypassing a list of banned words. The second involved making the chatbot reply under a persona of being devoid of moral restraints.

According to NTU, its researchers contacted the various AI chatbot service providers with proof-of-concept data, as evidence of being able to successfully conduct jailbreaks. Meanwhile, the research paper has been accepted for presentation at the Network and Distributed System Security Symposium which will be held in San Diego on Feb. 2024.

With the use of AI chatbots growing exponentially it is important for service providers to constantly adapt to avoid malicious exploits. Big tech companies will typically patch their LLMs / chatbots when bypasses are found and made public. However, Masterkey's touted ability to consistently learn and jailbreak is unsettling, to say the least.

AI is a powerful tool, and if such power can be directed maliciously it could cause a lot of problems. Therefore every AI chatbot maker needs to apply protections, and we hope that NTU's communications with the respective chatbot makers will help close the door to the Masterkey jailbreak and similar.

TOPICS

Roshan Ashraf Shaikh has been in the Indian PC hardware community since the early 2000s and has been building PCs, contributing to many Indian tech forums, & blogs. He operated Hardware BBQ for 11 years and wrote news for eTeknix & TweakTown before joining Tom's Hardware team. Besides tech, he is interested in fighting games, movies, anime, and mechanical watches.

4 Comments Comment from the forums

hotaru251

Thats been a known flaw since day 1. (just matter of time until it happened)

A program is only as strong as its weakest link....& as its made & trained off humans it is functionally flawed from the start.

A machine (even LLM) is going to follow its programming rules. That is a critical flaw in they can be abused by those who are skilled/knowledgeable about em.

I am curious if they would let law enforcement use it to see how much illegal stuff has been trained. (as im curious as to how deep that rabbithole is on scrapping web)
Reply
Alvar "Miles" Udell

As far as public chatbots which are trained on public data are concerned it's no big deal. The problem though is if this method can be used to attack private chatbots which are trained on sensitive data, such as chatbots used for customer service and can access people's personal information.
Reply
Joseph_138

Just what we don't need, is rival AI programs reinforcing each others capabilities.
Reply
kealii123

"Jailbreak", ie get it to tell the truth
Reply