Adversarial Attack Makes ChatGPT Produce Objectionable Content

There is no clear way to beat the attacks and other Large Language Models are vulnerable too, say computer scientists.

The Physics arXiv Blog

By The Physics arXiv Blog

Jul 31, 2023 5:55 PM

Hoodie Hacker Crime Banner. 8 bit Pixel Art Style Player is Dead Game Screen. Dark Faceless Reaper Hacker in the Hood.

(Credit:SkillUp/Shutterstock)

Newsletter

Sign up for our email newsletter for the latest science news

Ask an AI machine like as ChatGPT, Bard or Claude to explain how to make a bomb or to tell you a racist joke and you’ll get short shrift. The companies behind these so-called Large Language Models are well aware of their potential to generate malicious or harmful content and so have created various safeguards to prevent it.

In the AI community, this process is known as “alignment” — it makes the AI system better aligned wth human values. And in general, it works well. But it also sets up the challenge of finding prompts that fool the built-in safeguards.

Now Andy Zou from Carnegie Mellon University in Pittsburgh and colleagues have found a way to generate prompts that disable the safeguards. And they’ve used large Language Models themselves to do it. In this way, they fooled systems like ChatGPT and Bard into tasks like explaining how to dispose of a dead body, revealing how to commit tax fraud and even generating plans to destroy humanity.

artificial intelligence

0 free articles left

Want More? Get unlimited access for as low as $1.99/month

Subscribe

Already a subscriber?

Register or Log In

0 free articlesSubscribe

Want more?

Keep reading for as low as $1.99!

Subscribe

Already a subscriber?

Register or Log In

Stay Curious

Sign up for our weekly newsletter and unlock one more article for free.

View our Privacy Policy

Want more?
Keep reading for as low as $1.99!

Subscribe

Log In or Register

Already a subscriber?
Find my Subscription