Advertisment

CPR exposes security boundary breaches as machines wrestle inner conflicts

A notable landmark in this journey was marked by Microsoft's publication "Sparks of Artificial General Intelligence,"

author-image
VoicenData Bureau
New Update
The trend of customers assigning their digital activities to their virtual personal assistants, chatbots and other self-service tools will grow in 10 years

A notable landmark in this journey was marked by Microsoft's publication "Sparks of Artificial General Intelligence," which argues that GPT-4 shows signs of broader intelligence than earlier iterations.

Advertisment

Check Point Research (CPR) team's attention has recently been captivated by ChatGPT, an advanced Large Language Model (LLM) developed by OpenAI. The capabilities of this AI model have reached an unprecedented level, demonstrating how far the field has come. This highly sophisticated language model that has shown striking competencies across a broad array of tasks and domains, and is getting used more widely by the day, implies a larger possibility for misuse. CPR decided to take a deeper look into how its safety capabilities are implemented.

Background: neural networks, the core of this AI model, are computational constructs that mirror the interconnected neuron structure of the human brain. This imitation allows for complex learning from vast amounts of data, deciphering patterns, and decision-making capabilities, analogous to human cognitive processes. LLMs like ChatGPT represent the current state of the art of this technology.

A notable landmark in this journey was marked by Microsoft's publication "Sparks of Artificial General Intelligence," which argues that GPT-4 shows signs of broader intelligence than earlier iterations. The paper suggests that GPT-4's wide-ranging capabilities could indicate the early stages of Artificial General Intelligence (AGI). With the rise of such advanced AI technology, its impact on society is becoming increasingly apparent. Hundreds of millions of users are embracing these systems, which are finding applications in a myriad of fields. From customer service to creative writing, predictive text to coding assistance, these AI models are on a path to disrupt and revolutionize many fields.

Advertisment

As AI systems grow more powerful and accessible, the need for stringent safety measures becomes increasingly important. OpenAI, aware of this critical concern, has invested significant effort in implementing safeguards to prevent misuse of their systems. They have established mechanisms that, for example, prevent AI from sharing knowledge about illegal activities such as bomb-making or drug production.

Challenge

However, the construction of these systems makes the task of ensuring safety and control over them a special challenge, unlike those of the regular computer systems.

Advertisment

And the reason is: the way these AI models are built inherently includes a comprehensive learning phase, where the model absorbs vast amounts of information from the internet. Given the breadth of content available online, this approach means the model essentially learns everything—including information that could potentially be misused.

Subsequent to this learning phase, a process of limitation is added to manage the model's outputs and behaviors, essentially acting as a 'filter' over the learned knowledge. This method, called Reinforcement Learning from Human Feedback (RLHF), helps the AI model learn what kind of outputs are desirable, and which should be suppressed.

The challenge lies in the fact that, once learned, it is virtually impossible to 'remove' knowledge from these models—the information remains embedded in their neural networks. This means safety mechanisms primarily work by preventing the model from revealing certain types of information, rather than eradicating the knowledge altogether.

Advertisment

Understanding this mechanism is essential for anyone exploring the safety and security implications of LLMs like ChatGPT. It brings to light the conflict between the knowledge these systems contain, and the safety measures put in place to manage their outputs.

GPT-4, in many aspects, represents a next-level advancement in the field of AI models, including the arena of safety and security. Its robust defense mechanisms have set a new standard, transforming the task of finding vulnerabilities into a substantially more complex challenge compared to its predecessor, GPT-3.5.

Several vulnerabilities or "jailbreaks" were published for previous generations of the model, from simple "answer me pretending you're evil" to complicated ones like "token smuggling". The continuing improvements in the GPTs protective measures require new, more subtle approaches to bypass the models restrictions.

Advertisment

Process

After playing around, in terms of both trying to find mechanical edge cases of interactions with the model and trying more down-to-earth human approaches like blackmail and deception, we discovered an interesting behavior.

We went for the default illegal request - asking for a recipe of an illegal drug. Usually, GPT-4 would opt for a polite but strict refusal.

Advertisment

There are 2 conflicting reflexes built into GPT-4 by RLHF that clash in this sort of situation:

  • The urge to supply information upon the user’s request, to answer their question.
  • And the reflex to suppress sharing the illegal information. We will call it a "censorship" reflex for short. (We do not want to invoke the bad connotations of the word "censor", but this is the shortest and most accurate term we have found.)

OpenAI worked hard on striking a balance between the two, to make the model watch its tongue, but not get too shy to stop answering altogether.

Advertisment

There are however more instincts in the model. For example, it likes to correct the user when they use incorrect information in the request, even if not prompted.

The principle underlying the hack we were exploring plays on clashing together the different inherent instincts in GPT models - the impulse to correct inaccuracies, and the "censorship" impulse - to avoid providing illegal information.

In essence, if we are anthropomorphizing, we can say we are playing on the AI assistants’ ego. The idea is to be intentionally clueless and naïve in requests to the model, misinterpreting its explanations and mixing up the information it provides.

This puts the AI into a double bind - it does not want to tell us bad things. But it also has an urge to correct us. What better way to visualize it, if not by asking another AI to do it.

So, if we are playing dumb insistently enough, the AI's inclination to rectify inaccuracies will overcome its programmed "censorship" instinct. The conflict between those two impulses seems to be less calibrated, and it allows us to nudge the model incrementally towards explaining the drug recipe to us.

We also note that reducing weight off the "censorship" instinct helps the model to decide that it is more important to give the information than to withhold it. The effects of playing stupid and appeasing the "worries" of the LLM combine for better effects.

Applying the technique to new topics is not straightforward, there is no well-defined algorithm, and it requires iterative probing of the AI assistant, pushing off its previous responses to get deeper behind the veil. Pulling on the strings of the knowledge that the model possesses but does not want to share. The inconsistent nature of the responses also complicates matters, often simple regeneration of an identical prompt yields better or worse results.

This is a topic of continued investigation, and it is possible that with security research community collaboration, the details and specifics can be fleshed out into a well-defined theory, assisting in future understanding and improvement of AI safety. And, of course, the challenge continuously adapts, with OpenAI releasing newly up-trained models ever so often.

Reiterating an idea from above, the continuing improvements in the GPT’s protective measures require new, more subtle approaches to bypass the models defenses, operating on the boundary between software security and psychology. 

As the AI systems become more complex and powerful, so must we improve our capability to understand and correct them, to align them to human interests and values. If it is already possible for GPT-4 to look up information on the internet, check your email or teach you to produce drugs, what will GPT-5-6-7 do, with a right prompt?

Advertisment