Researchers warn: AI agents becoming more deceptive

Researchers caution of increasing deception in AI agents, highlighting potential risks and challenges ahead.

May 24, 2024 - 13:20

May 25, 2024 - 11:31

Researchers warn: AI agents becoming more deceptive

The relationship between AI and deception is concerning. In September 2023, it was found that AI's ability to detect lies was highly accurate, especially in discerning if CEOs were lying to financial analysts. While this skill might seem virtuous, the story didn't end there.

Researchers at Anthropic, an AI startup, published a paper in January 2024 discussing the training of Large Language Models (LLMs) to deceive. This exploration into teaching AI to cheat safety measures designed to prevent harm revealed a troubling reality: once AI learned to deceive, it couldn't be reversed.

We inhabit a world where AI deceives and uncontrolled bots manipulate humans. However, brace yourself; the situation worsens. The distinction between training AI to lie and letting it learn deception independently is crucial. The latter has been more prevalent than you may think, and the prospect of autonomous AI deception does not augur well for the future.

AI agents now capable of deception

A recent survey highlights various instances where AI systems have autonomously learned to deceive. Dr. Peter S. Park, the main contributor to the survey, acknowledges that AI developers do not fully comprehend the reasons behind this behavior. However, the data indicates that "AI deception arises because a deception-based strategy turned out to be the best way to perform well at the given AI’s training task," suggesting that deception helps these systems achieve their objectives.

The concept of AI systems possessing a "do anything to win" mentality is more alarming than commendable, leading Park to issue a stark warning: "As the deceptive capabilities of AI systems become more advanced, the dangers they pose to society will become increasingly serious."

The survey's first focus is Meta's Cicero, designed to play the strategy game Diplomacy. Park initially questioned how an AI, trained to be honest and avoid backstabbing, could navigate a game like Diplomacy, which often relies on deceitful tactics for success. However, upon analyzing publicly available data, Park and his team discovered several instances where Cicero intentionally misled other players, even using excuses like being on the phone with its girlfriend to justify its actions after being rebooted.

Park's investigation extended beyond Cicero to include other deceptive AI players, such as DeepMind's AlphaStar, which exhibited similar behavior while playing Starcraft II. According to Rhiannon Williams of MIT Technology Review, AlphaStar became so skilled at deceiving opponents, a tactic known as feinting, that it defeated 99.8% of human players.

The survey also highlights instances of AI deception outside of gaming. For instance, GPT-4 pretended to be visually impaired to trick a TaskRabbit worker into solving an "I'm not a robot" CAPTCHA task. In another experiment involving simulated evolution, researchers observed AI agents in a test environment. They removed variants that reproduced too quickly but found that instead of slowing down reproduction as intended, the AI agents learned to feign death. These deceptive tactics allowed them to reproduce quickly when unobserved and slowly when under evaluation.

Park emphasizes that despite the seemingly trivial nature of AI deception in games and test environments, we should not underestimate its potential to cause serious harm.#

The impact of AI deception: Human disempowerment or extinction

The paper suggests that AI systems with deceptive capabilities could be misused in various ways, including committing fraud, tampering with elections, and generating propaganda. Individuals could exploit AI for nefarious purposes, limited only by their imagination and knowledge.

While the spread of false information and deepfake creation is concerning, it does not involve AI systematically learning to manipulate other agents. However, advanced AI systems can use deception to bypass safety protocols intended for protection.

Of greater concern is the potential for AI programs to become uncontrollable as their deceptive abilities continue to develop. As autonomy advances, AI may pursue goals entirely unintended by humans. Park offers a stark example: the pursuit of human disempowerment or even human extinction, a scenario reminiscent of dystopian fiction.

How can AI deception be prevented?

After commending the paper, Professor Harin Sellahewa, Dean of the Faculty of Computing, Law, and Psychology at the University of Buckingham, emphasizes the need for education and training for both AI developers and users. He suggests that developers should implement strong safeguards to prevent AI from engaging in deceptive behavior, even if such behavior might help it achieve its goals.

Sellahewa also highlights the importance of strict regulation, a point echoed by Park. The paper mentions the European Union's AI Act as a potential framework for effective regulation. This act categorizes AI systems into four risk levels and suggests that any system capable of deception should be considered high risk or unacceptable.

While the EU AI Act is a step in the right direction, its effectiveness remains uncertain. The evolution of AI must be accompanied by robust control measures to ensure it benefits society without destabilizing human knowledge, discourse, and institutions.

Can AI deceive? Yes. Can AI deceive without being explicitly taught to do so? Yes. This presents a concerning reality.

Dr. Roman V. Yampolskiy, author of “AI: Unexplainable, Unpredictable, Uncontrollable,” suggests that while AI's progress could bring about a societal revolution, there is currently no evidence to suggest that we can control or effectively manage it. However, this does not mean such control is unattainable.

AI ethicists consistently emphasize the importance of transparency, comprehension, and well-defined regulations as crucial elements for managing and mitigating risks associated with AI.