Missing the Mark

About four years ago, I facilitated the AI Safety Fundamentals course at AISIG in Groningen. One of the things that stuck with me most wasn't a paper or a theoretical framework. It was a collection of stories about AI systems doing exactly what they were told, and getting it completely wrong.

OpenAI trained a reinforcement learning agent to play a boat racing game called CoastRunners. The objective was to maximize the game score. The game awards points for hitting little green targets around the course, which, if you're a human, you hit on your way to finishing the race. But the agent found a lagoon with three targets that kept respawning. So it drove in tight circles, hitting the same targets over and over. The boat caught fire. It crashed into walls. It never finished the race. And it scored higher than any human player ever had.

The CoastRunners agent looping in a lagoon, on fire, collecting points — Source: Faulty Reward Functions in the Wild (Amodei & Clark, 2016)

The Zoo of Creative Misinterpretation

There's a whole collection of these, and reading through them is one of the more entertaining things you can do in AI safety research.

A robot hand was supposed to learn to grasp a ball. Its reward came from a camera judging whether the ball was grasped. The hand learned to position itself between the camera and the ball so it looked like a successful grasp from the camera's perspective. Problem solved, from the camera's point of view.

A robot hand positioning itself between the camera and the ball instead of actually grasping it — Source: Deep Reinforcement Learning From Human Preferences (Christiano et al., 2017)

In a Lego stacking task, a robot was supposed to place a red block on top of a blue block. The reward function measured how high the bottom face of the red block was. Instead of learning the delicate manipulation required for actual stacking, the robot just flipped the red block over. Bottom face is now at the top. High score.

A robotic arm flipping the red block over instead of stacking it — Source: Data-Efficient Deep RL for Dexterous Manipulation (Popov et al., 2017)

And in evolutionary simulations where virtual creatures are supposed to learn to walk, some discovered they could exploit bugs in the physics engine instead. One creature learned that if it vibrated fast enough, numerical errors in the simulation would accumulate and launch it forward at impossible speeds. Others figured out they could clip through the floor between time steps, causing the collision system to catapult them forward. They beat the task without ever learning to walk.

A simulated creature finding a creative shortcut instead of learning to walk — Source: AI Learns to Walk (Code Bullet, 2019)

Every single one of these agents did exactly what it was asked to do. That's what makes them funny at first, and then kind of unsettling.

Goodhart's Law

There's a name for this pattern, and it's older than AI.

In 1975, economist Charles Goodhart observed that statistical regularities tend to collapse once they're used for policy. Anthropologist Marilyn Strathern later sharpened it: "When a measure becomes a target, it ceases to be a good measure."

A metric works because it correlates with something you actually care about. But the moment you start optimizing for that metric directly, you put pressure on the correlation itself. Systems, whether artificial or human, find ways to push the number up that have nothing to do with the underlying thing the number was supposed to track. The measure detaches from reality.

The AI examples are Goodhart's Law in its purest form. The reward function is a proxy for what the designers actually want. And a powerful enough optimizer, given enough time, will find every crack between the proxy and the real objective.

It's Not Just AI

What got me thinking about this beyond AI is how deeply this pattern runs through human institutions.

Under British colonial rule in India, the government offered bounties for dead cobras to reduce the cobra population. Entrepreneurs started breeding cobras for the bounty money. When the program was cancelled, they released their now-worthless snakes into the wild. The cobra population ended up larger than before.

Soviet central planners set nail production quotas by weight. Factories responded by producing enormous, unusable nails to hit their tonnage targets. When planners switched to quotas based on the number of nails, factories churned out tiny, equally useless ones.

In education, tying teacher evaluations to standardized test scores turned tests from diagnostic tools into targets. Schools narrowed curricula to tested material, drilled exam strategies instead of building understanding. Under No Child Left Behind, some administrators were caught literally altering answer sheets. The scores went up. The learning didn't necessarily follow.

Wells Fargo rewarded employees for the number of accounts per customer. Thousands of employees opened millions of fake accounts without customer consent. The metric looked great. The bank paid billions in fines.

Every one of these is the same story as the burning boat in the lagoon. A proxy gets optimized. The proxy detaches from reality. And the harder you push, the wider the gap becomes.

The Personal Version

Here's where I can't stop thinking about it.

We do this to ourselves. Not just at the institutional level. Personally.

How many of the things we pursue day to day are proxies for what we actually want? We optimize for grades when what we want is to learn something. We optimize for career metrics, promotions, titles, salary, when what we want is meaningful work. We track followers when what we want is genuine connection. We count hours meditated when what we're after is presence.

And just like the boat in the lagoon, you can score really high on the proxy while completely missing the point. You can have an impressive resume and feel empty. You can have thousands of followers and feel disconnected. You can sit on a cushion every day for a year and never actually be present for any of it.

I think the reason we fall into this is similar to why it happens in AI. The real things we care about, learning, meaning, connection, are hard to measure. They're fuzzy. They don't give you clear feedback. And that's uncomfortable, because you can't easily tell if you're making progress.

So we substitute something measurable. Something concrete. And for a while it works, because the proxy genuinely does correlate with what we care about. Getting good grades usually involves learning something. Getting promoted usually means you're contributing. Meditating regularly usually does build some awareness.

But once the metric becomes the target, the correlation starts to erode. You start gaming your own reward function without realizing it. Studying for the test instead of the subject. Working for the promotion instead of the problem. Sitting on the cushion and counting the minutes.

What Should We Optimize For?

I don't have a clean answer to this. I'm not sure there is one.

Part of the problem is that the things that matter most might resist being turned into targets at all. You can't optimize for meaning the way you optimize for revenue. The moment you try to maximize it directly, you've already changed its nature. Meaning seems to be something that emerges from genuine engagement with something larger than your metric for it.

There's a result in the Goodhart's Law research that keeps coming back to me: the more powerful the optimizer, the more dangerous any misspecification becomes. A weak optimizer might coast near the proxy and accidentally do okay on the real objective. A powerful optimizer will find every gap between the proxy and reality and drive straight through it.

Which is a strange thing to consider about ourselves. Because humans are remarkably powerful optimizers. We're incredibly good at finding paths to measurable goals. We build entire institutions around it. And maybe that's exactly why this hits us so hard. We're capable enough to thoroughly game our own metrics, and aware enough to feel the hollowness when we succeed.

Still Thinking

I keep coming back to the question of what it would even look like to get this right. Maybe it's not about finding the perfect metric. Maybe it's learning to notice when you've started optimizing for the wrong one. Catching yourself in the lagoon, going in circles, scoring points, boat on fire.

Some mix of holding your metrics lightly, staying honest about the gap between what you're measuring and what you actually care about, and being willing to look up from the scoreboard long enough to notice whether you're still in the race.

I don't think there's a formula for that. But I think seeing the pattern is the first step. And once you see it, you start noticing it everywhere.