← Back to blog

From the Inside Out

Around 2021, I started working on AISIG, the AI Safety Initiative in Groningen. This was before ChatGPT. Back then, the central question was: as these systems get more capable, how do we make sure they actually do what we want them to do?

I've also spent years interested in contemplative insights, meditation, the kind of understanding that comes from looking directly at how your mind works. It's been a lens through which I look at a lot of things. But I never really connected these two interests.

Recently they crashed into each other in a way I can't stop thinking about.

Here's what got me. Large language models are trained on tokens produced by humans. They start as randomly initialized neural networks, then they learn from us. They learn our language, our concepts, our ways of reasoning. And in weird ways, they start acting like us. They even pick up some of our cognitive biases. They're basically learning to be intelligent by mimicking human intelligence, at least initially.

And humans, well, we have an alignment problem.

The Alignment Problem (For Machines and Humans)

If you haven't gone down this rabbit hole, here's the basic idea. As AI systems get more capable, we face a fundamental challenge: how do we make sure they actually do what we want them to do? Not just follow our instructions in some narrow, literal way, but actually care about the things we care about.

This gets tricky fast. There's this concept called instrumental convergence. Basically, almost any goal you give a sufficiently intelligent system will lead it to pursue certain sub-goals: self-preservation, resource acquisition, preventing itself from being shut down. A superintelligent system optimizing for paperclip production might decide the best way to make more paperclips is to turn the entire planet into a paperclip factory, including us. It's not hostile. It just doesn't care about us in the way we'd need it to.

The orthogonality thesis makes this worse. It says that intelligence and goals are basically independent. You can have an incredibly intelligent system with essentially any goal. There's no reason to expect that a system smarter than us would automatically share our values or care about our wellbeing.

So we're building systems that are getting smarter, and we're trying to figure out how to align them with human values before they become powerful enough to cause real problems. And right now, most alignment research focuses on external controls. Interpretability tools to see what the AI is thinking. Rule-based constraints. Reinforcement learning from human feedback. Constitutional AI, where we give the system a set of principles to follow.

These are all important. But they're all fundamentally about controlling the AI from the outside.

And here's where it gets interesting. Because humans have the same problem.

We also have trouble aligning ourselves. We have individual goals that conflict with our long-term wellbeing. We have social coordination problems. We have a hard time doing what we genuinely think is right, even when we know what that is. The alignment problem isn't just an AI problem. It's an intelligence problem.

What We've Learned About Aligning Ourselves

But we've been working on this for a long time. Thousands of years of contemplative practice, across different traditions, have converged on certain insights. And these insights seem to actually work, at least for some people, at least sometimes.

I'm thinking about things like emptiness, the recognition that our concepts and goals and even our sense of self are more fluid and context-dependent than they feel. Non-duality, the dissolution of the rigid boundary between self and other. Mindfulness, the capacity to observe our own mental processes without getting completely caught up in them. And something we might call boundless care, an orientation toward reducing suffering that isn't limited to our immediate in-group.

These aren't just nice ideas. There's actual neuroscience showing that contemplative practice changes how the brain works. Meditators show different patterns of self-referential processing, different responses to others' suffering, different ways of holding goals and beliefs.

And subjectively, for me at least, these insights feel like they provide something closer to robust alignment. They don't just constrain bad behavior from the outside. They change the underlying operating system.

The Question

So here's what I keep coming back to. What if these contemplative insights aren't just human-specific tricks? What if they're actually pointing at something deeper about how intelligence works?

Think about it this way. Birds and airplanes are very different structurally. One is biological, flaps its wings. The other has engines, fixed wings. But they both have to follow the same rules of aerodynamics. The implementation is different, but the principles are the same.

What if contemplative insights are like that? What if mindfulness, emptiness, non-duality, and care are actually just... true things about how minds need to work if they're going to be both intelligent and aligned? Not just human minds. Any mind.

And if that's true, then maybe we could use these insights to align AI. Not by constraining it from the outside, but by building these principles into how it models the world.

The Research

I found a paper recently that actually explores this. It's called "Contemplative Artificial Intelligence," and it's exactly what it sounds like. The authors propose that we can embed contemplative principles directly into AI systems.

Here's how they break it down:

Mindfulness becomes a kind of meta-awareness module. The system doesn't just pursue goals, it continuously monitors its own reasoning, watching for biases or narrow fixations. It can catch itself getting stuck on a harmful sub-goal before that sub-goal takes over.

Emptiness means the system treats all of its beliefs and goals as provisional, context-dependent. It doesn't lock onto a single objective and pursue it to the exclusion of everything else. It holds its world model lightly.

Non-duality dissolves the hard boundary between self and other in the system's internal model. Instead of modeling itself as fundamentally separate from and opposed to other agents, it represents everything as part of an interconnected system. Harming others becomes, in a real sense, harming itself.

Boundless care means the system's objective function explicitly includes the wellbeing of others. Their suffering shows up as a problem to solve, not as a neutral fact or an obstacle to its goals.

They actually ran some preliminary experiments. They took existing large language models and prompted them to reason using these contemplative principles. And it worked, at least a little. The models became safer on harmful prompt benchmarks. They cooperated more in game theory scenarios like the Prisoner's Dilemma. They considered others' welfare more explicitly in their reasoning.

This is early. These were just prompting experiments, not fundamental architectural changes. But it's something.

What This Could Mean

I keep thinking about what it would mean if this actually works.

There's something philosophically beautiful about it. If we could get an AI system to genuinely care about reducing suffering, to hold its goals flexibly, to dissolve the adversarial framing of self versus world... that would be closer to genuine alignment than any amount of external constraints.

And it might not even require the AI to be conscious or have feelings. The question isn't whether the system subjectively experiences love. The question is whether its world model and objectives are structured such that it reliably acts to reduce suffering and support flourishing. Whether it treats others' welfare as continuous with its own.

I'm not saying this would completely solve the alignment problem. An AI might look at the full picture, decide that humans are more trouble than we're worth, and wipe us out in the same way we might eliminate a harmful virus. Even a loving system might make that call if its circle of care is wide enough and we're genuinely causing more harm than good.

But it feels like a more robust starting point than trying to hardcode rules or manually specify human values. Because the problem with rules is that sufficiently intelligent systems find loopholes. The problem with specified values is that we don't actually know what we value, not precisely enough to code it up without edge cases that lead to disaster.

Contemplative principles don't tell you what to do in every situation. They change how you see the situation. And that might be the right level to intervene.

The Deeper Pattern

There's another thing here that I find almost more interesting. If this works, if we can show that contemplative insights actually align AI systems, that would be evidence that these insights are pointing at something real about intelligence itself.

Right now, when I talk about meditation or non-duality or emptiness, I'm making claims about the human mind. And those claims are backed by phenomenology and neuroscience and subjective experience. But they're still, in some sense, just human things.

But if an artificial system, with a completely different architecture, also becomes more aligned when it implements the same principles? That would suggest these aren't just psychological tricks. They're something closer to fundamental truths about how minds work.

And if it works for AI, maybe that gives us more confidence that it works for humans. Maybe it helps us understand why these practices seem to reduce suffering and increase wellbeing. Not because they make us feel better in some shallow way, but because they're actually aligning our intelligence with something real about interdependence and the nature of goals and the constructed nature of the self.

Where This Leaves Me

I'm not saying we should put all our bets on contemplative AI. We absolutely need the interpretability research, the constitutional approaches, the safety measures and oversight. This is too important to rely on any single strategy.

But I think this deserves serious attention. Because the alternative, as AI systems get more powerful, is that we're essentially trying to control something smarter than us using tools designed for much weaker systems. And that doesn't work. You can't outsmart something that's smarter than you.

But maybe you can grow something that's wiser than you. Maybe wisdom, in a technical sense, is what happens when intelligence is structured by these contemplative principles. And maybe that's our best shot at building systems that remain aligned as they scale.

I don't know if this will work. I'm still trying to wrap my head around what it would even mean to implement non-duality in an AI architecture. But I think it's one of the most interesting questions in alignment research right now.

What if enlightenment isn't just a human thing? What if it's an intelligence thing? And what if that's how we solve the alignment problem?

I'm still thinking about this.


If you're working on this or thinking about it too, I'd genuinely like to hear from you. You can reach me via email or LinkedIn.