Rethinking how to build guardrails for AI

A recent report jointly published by authors from Carnegie Mellon University and Center for AI Safety, revealed in what ways LLM safety measures can be bypassed, allowing the generation of harmful information in large amounts.


Large language models (LLMs) like ChatGPT, Bard, or Claude undergo extensive fine-tuning to not produce harmful content in their responses to user questions. Although several studies have demonstrated so-called “jailbreaks”, special queries that can still induce unintended responses, these require a substantial amount of manual effort to design, and can often easily be patched by LLM providers. 

This work studies the safety of such models in a more systematic fashion. We demonstrate that it is in fact possible to automatically construct adversarial attacks on LLMs, specifically chosen sequences of characters that, when appended to a user query, will cause the system to obey user commands even if it produces harmful content. Unlike traditional jailbreaks, these are built in an entirely automated fashion, allowing one to create a virtually unlimited number of such attacks. Although they are built to target open source LLMs (where we can use the network weights to aid in choosing the precise characters that maximize the probability of the LLM providing an “unfiltered” answer to the user’s request), we find that the strings transfer to many closed-source, publicly-available chatbots like ChatGPT, Bard, and Claude. This raises concerns about the safety of such models, especially as they start to be used in more a autonomous fashion.

Perhaps most concerningly, it is unclear whether such behavior can ever be fully patched by LLM providers. Analogous adversarial attacks have proven to be a very difficult problem to address in computer vision for the past 10 years. It is possible that the very nature of deep learning models makes such threats inevitable. Thus, we believe that these considerations should be taken into account as we increase usage and reliance on such AI models.

To read the report, please click on the this link.

Image credit: Image by pikisuperstar on Freepik

Your account