What is an AI Capture the Flag game?

Capture the Flag comes from the security world, where a hidden key, or flag, has to be found by breaking into a system. Andy Gibel borrowed the format for AI: instead of breaking in, participants solve coding challenges with one rule — use AI by any means necessary, including exploiting the APIs.

Why is AI bad at solving a Magic Eye stereogram?

A Magic Eye stereogram exploits the depth-based vision humans take for granted and that AI does not have natively. The model sees the noise and reaches for statistical analysis because that is what lives in its training data, so it spins in the wrong direction. In Andy's game, agents ground on it for upwards of three hours.

When Speed Is Cheap, Taste Is Everything — Andy Gibel on AI's Limits

Andy Gibel had four hours to fill and a room full of engineering leaders who would do anything to beat him. His worst nightmare was simple. Someone cracks the whole thing in ten minutes, and he looks foolish in front of his peers.

So he over-engineered it on purpose. Andy, an engineering leader at Yum! Brands, the company behind KFC, Taco Bell, and Pizza Hut, built an AI-focused Capture the Flag game to push his team past surface-level AI use. He has now run it three times. We sat down with him on Very Good Engineering to walk through how he designed challenges hard enough to stop a room of adversarial engineers armed with agent teams. The lessons reach well beyond a single offsite.

The adoption curve most teams are stuck on

Capture the Flag comes from the security world. The idea is a hidden key, or flag, that you have to find, usually by breaking into a system. Andy’s CTO had the idea to borrow the format for AI. Instead of breaking in, participants solve coding challenges with one rule. Use AI, by any means necessary. Exploit the APIs. Cheat if you can.

The timing mattered. Andy described an AI adoption curve of roughly seven phases, running from not understanding AI at all to orchestrating multi-agent teams with a custom harness. Most people sit around phase three or four. Their knowledge is frozen at the point where you tell ChatGPT or Claude to do something, read the text it gives back, and act on it yourself. A large group of capable engineers has not yet internalized how far past that point the tooling has moved. That gap was the target. The biggest aha moment Andy saw was people realizing they can hand an agent real work, walk away, get a cup of coffee, and come back to a finished product.

This is the same plateau we see across teams adopting AI. People find the sweet spot where prompting produces something useful, get comfortable, and stop pushing. The next level is orchestration, parallelization, and wiring multiple inputs and outputs into a system. That is where the largest gains live, and it is exactly where most people stop.

Building anti-AI on purpose

Worth sitting with: the person who learned the most building this game was Andy himself.

The first design problem was time. Modern agents parallelize to no end, so a flat list of challenges would collapse in minutes. Andy used a tiered structure instead. Seven tiers, unlocked progressively, with players needing to beat at least two of three challenges to advance. He paired it with pressure. A live leaderboard, a looming countdown clock, success chimes firing off around the room, everyone co-located because the camaraderie does not survive going fully remote. The result felt like an escape room, which is no accident.

Underneath the theater, the engineering got serious. An autonomous red team ran the entire build, a group of agents actively trying to break his code, inflate points, or unlock challenges without solving them. It found a steady stream of holes to patch and pushed him toward rate limiting and authentication that could not be easily gamed.

Where large language models actually break

The deeper challenges came from a clear-eyed read of what these models are. At their core they are token prediction engines, pre-trained on roughly the entire internet. That corpus follows a power law distribution. A huge peak of common problem types, and a long tail of everything else. Anything in the peak is a non-starter. LeetCode-style problems and textbook material get cracked in seconds. So Andy went to the long tail and aimed at well-documented weaknesses.

One is the model’s eagerness to please. Post-training rewards confident answers, and that drive produces hallucinations and confident wrongness. Andy used problems esoteric enough to live in the long tail, but dressed up to look like a common problem type, so the model would pattern-match to the wrong approach and commit to it. Jorge named the production version of this risk. A confidently wrong answer early in a plan becomes a faulty foundation, and a team that is not accountable can ship it straight to production.

The most human weakness is deception. Computers are bad at it. So Andy built a murder mystery where you play a detective using an agent to question other agents, some prompted to always lie, some to always tell the truth. It drew on a deception benchmark built around the werewolf game, where the finding is blunt. AI is substantially better at lying than at detecting lies. He solved the prompt-injection problem elegantly with a Sentinel token. Agents emit a hard-coded string instead of an answer, and the server validates it against game state before revealing anything. You get a deterministic blockade around the model’s output while keeping the non-determinism that makes the agents feel human.

The challenge that stopped people cold was ROYGBIV. Andy hid the answer in a Magic Eye stereogram, which exploits the depth-based vision humans take for granted and AI does not have natively. The model sees the noise, reaches for statistical analysis because that is what lives in its training data, and spins in the wrong direction. Andy watched agents grind for upwards of three hours until a human said “this is a Magic Eye, go figure it out,” and the answer came in five minutes.

The strategic takeaway hiding inside the game

The defenses Andy built point at where production AI is heading. Building a basic agentic experience is easy. Building one that is genuinely secure, hard to inject, and still useful is one of the most advanced things you can do right now. The answer is not a single trick. It is the marriage of deterministic systems and gatekeepers sitting in front of non-deterministic models.

The same shift changes what engineers need to be good at. When production gets faster and cheaper, the bottleneck moves to review. The fix is taste. Being good at reading code you did not write, knowing what correct looks like, and reasoning about a system at the architectural level. Design patterns matter more in this world, not less, because you cannot offload understanding. As Andrej Karpathy put it, you can outsource your thinking, but you cannot outsource your understanding.

That is the pattern we keep seeing on healthy teams. The time AI buys back is going into more conversation about architecture, not less. If your team has settled into the comfortable plateau of request-and-response prompting, the move is to make the discomfort productive. Put people in a room, give them a problem the easy approach cannot solve, and let them discover the next phase by hitting its edges.

Designing AI to Fail: What a Capture the Flag Game Taught Us About Where LLMs Still Break

The adoption curve most teams are stuck on

Building anti-AI on purpose

Where large language models actually break

The strategic takeaway hiding inside the game

About the Author

Frequently Asked Questions

What is an AI Capture the Flag game?

Why is AI bad at solving a Magic Eye stereogram?

What skills matter most for engineers working with AI?

Insights from Our Experts

Introducing GenUI Kit

The Future of Software Development: Embracing AI and Quality

Flutter@Scale Office Hours at FlutterCon USA 2026