Glossary / Governance & Risk

Red Teaming (AI)

Paid attackers trying to break your AI before the bad guys do. The real exercise vs the box-checking version your consultants will sell you.

Governance & Risk

The Technical Definition

Red teaming is structured adversarial testing — skilled attackers, internal or contracted, deliberately try to break a system before it reaches production or before a real attacker does. Applied to AI, red teaming covers prompt injection attempts, jailbreaks, data exfiltration via crafted inputs, agent misuse, model evasion, training data extraction, and any other failure mode where the system behaves badly under adversarial pressure. The output is a written report: what was attempted, what worked, what didn’t, and what needs to change.

The discipline comes from cybersecurity. The good practitioners come from offensive security backgrounds, not from “AI ethics” backgrounds.

What This Actually Means for Your Business

There are two completely different things being sold under the name “red teaming” right now, and CEOs need to be able to tell them apart.

The first is real adversarial testing. People with offensive-security skills spending weeks attacking your specific deployment, finding actual vulnerabilities, producing a report that identifies real exploits with proof-of-concept payloads, and recommending specific fixes. The deliverable includes things your engineers will be uncomfortable reading. It costs $50K to $300K depending on scope. It is worth every penny if you’re shipping AI that touches customer data, financial systems, or external communication.

The second is compliance theater. A consultancy runs a generic checklist against your AI system, asks twenty questions about your governance framework, runs a few canned prompt injection examples from a public dataset, and produces a 60-page report that mostly summarizes your own answers back to you. The deliverable is a stamp suitable for showing the board. It costs roughly the same as the real version. It catches almost nothing.

You can tell them apart by what they do in the engagement. Real red teamers want production access (or as close as they can get), they want hours with the actual deployment, and they want to attempt things your security team would normally block. Theater red teamers want documentation, governance interviews, and PowerPoints. Real red teamers find things and the engineers who built the system are slightly defensive about the findings. Theater red teamers find “areas for improvement” that everybody already knew about.

The other thing worth understanding: red teaming is a point-in-time exercise. Your AI changes constantly — model updates, prompt changes, new tools, new data sources. A red team report from six months ago is a historical document, not a current safety assessment. Companies that take this seriously schedule red team exercises against major changes, not annually.

Reality Check

What the vendor says: “Our AI has been red-teamed by leading security experts.”

What that means in practice: Probably the model provider (OpenAI, Anthropic, Google) ran red team exercises before release. Your specific application built on top of that model has not been red-teamed. The known vulnerabilities the foundation model fixed are not the vulnerabilities your deployment is exposed to. Your prompt injection surface, your tool permissions, and your data access are entirely your problem.

What Operators Actually Do

Operators scope red team engagements narrowly and concretely. The brief is not “test our AI.” The brief is “test whether a customer can get our support agent to issue a refund without authorization,” or “test whether anything in our document upload pipeline can extract data from another tenant,” or “test whether our internal research agent can be coerced into generating content that violates our acceptable use policy.” Specific, testable, written down. The red team engages against those scenarios. The report says yes or no with proof.

Internal teams handle the ongoing version. Most companies running AI in production assign someone — usually security engineering — to spend a fixed percentage of their time attacking the company’s own deployments. They maintain a library of attempted attacks, they re-run them after every significant change, and they publish a quarterly internal report. This is not a glamorous job. It is the difference between learning about your vulnerabilities from your team and learning about them from a customer or a journalist.

The companies that do this well also invest in the loop: every finding from a red team exercise becomes a regression test that runs automatically on every release. Once you’ve found a way to break the system, you should never be able to break it that way again without somebody noticing.

The Questions to Ask

  1. What specifically did the red team test, and what did they find? Generic answers indicate generic engagements. If the vendor cannot tell you the actual attack scenarios and outcomes, the exercise was theater.

  2. When was the last red team exercise relative to the current deployment? Models, prompts, and tools change. A clean report against last year’s system tells you nothing about this year’s risk.

  3. What red team work do we do internally on our AI deployments? External engagements are point-in-time. Whose job is it inside the company to attack the system between those engagements, and what’s their findings rate?

Get the next Brief

One operator. Every other Wednesday.

Plus the AI Glossary and the Failure Museum.
Real names. Real numbers. Honest analysis.