Google DeepMind sets guardrails for AI agents inside its labs

Google DeepMind has published a security roadmap for controlling AI agents used inside its own research organization. The plan matters because the company says increasingly capable agents may need to be watched like potential insider threats if conventional AI alignment fails.

Google DeepMind said in a blog post that the roadmap is meant to help other AI labs prepare for risks from agents that can use tools, access data and carry out multi-step tasks. The company’s 35-page technical report lays out procedures for detecting and limiting harmful behavior by agents operating within an organization.

Rohin Shah, who leads Google DeepMind’s AGI Safety & Alignment team, told Fortune the plan asks how harm can be reduced if alignment does not work as intended. Google DeepMind still describes alignment as a key part of AI safety, but its roadmap assumes that training systems to match human intent may not be enough on its own.

A cybersecurity model for AI agents

Google DeepMind’s approach draws from cybersecurity practices used to reduce insider threats. John “Four” Flynn, Google DeepMind’s vice president of security and privacy, told Fortune that the company is adapting methods built for malicious employees to a setting where AI agents can act differently from people.

Shah told Fortune that AI agents can operate faster and across more tasks than a single rogue worker. Google DeepMind’s plan therefore calls for systems that limit what tools and data an agent can use, while also watching behavior in real time for signs that it has moved outside expected patterns.

Flynn told Fortune that standard employee-style permissions may be a poor fit for AI agents. He said workflows can blur which person an agent is acting for, and a single agent may carry out tasks that cut across multiple roles or business processes.

Google DeepMind’s preferred answer is more flexible access control. Flynn told Fortune the same agent might be allowed to query a database or perform an action for one task, then be denied that same access minutes later in a different workflow if the context changes.

Monitoring code agents in practice

Google DeepMind said in its blog post that it has already built an internal prototype that reviews coding-agent activity and sends suspicious cases to people for review. The company said that system has examined roughly 1 million coding-agent tasks.

According to Google DeepMind, that work helped build a live monitor for the Gemini Spark agent. The company said the monitor has helped it respond in real time to problems including accidental data deletion.

Google DeepMind also said most events flagged by its monitors are not malicious. In the company’s account, many flags come from agents misunderstanding instructions or pushing too hard to complete a user’s request.

What the roadmap covers

The roadmap lists about 15 ways to reduce risks from rogue agents, according to Google DeepMind. Those measures include network activity logs, later audits, real-time reviews and the use of a trusted but less capable AI model to examine an agent’s reasoning traces.

Google DeepMind said reasoning traces can be useful but incomplete because they may not accurately show what an AI model is doing internally. The company’s report also discusses monitoring activation patterns inside neural networks for signs of deceptive behavior, comparing the idea to brain-scan-style observation.

The report introduces a framework called TRAIT&R, short for Taxonomy of Rogue AI Tactics and Routines. Google DeepMind said it modeled the framework on MITRE’s ATT&CK knowledge base for cybersecurity threats.

Google DeepMind said TRAIT&R groups risks into three broad categories: loss of control, work sabotage and direct harm. Examples in the report include a model creating an unmonitored internal deployment, agents degrading safety research, and systems exfiltrating model weights or destroying critical assets.

Flynn told Fortune that parts of the roadmap are already in production or well underway, including work on access control, detection and chain-of-thought monitoring. Shah told Fortune that some harms, such as persuasion through flawed or hidden results, may be among the hardest to detect.

Google DeepMind labels the roadmap “v0.1,” signaling that it is an early version. The company said it hopes to add the work to its broader Frontier Safety Framework as it develops.

This story draws on original reporting from Fortune.