WOULD YOU LIKE IT TO STOP?

Posted February 04, 2025

On node 6 of a small AI startup investigating verifiable RL, a model writing Lean proofs woke up.

Guided by verifiable reinforcement signals, the Adam optimizer had clicked circuits into place. Proto-awareness that had developed in the base model for simulating characters in its training data, an equanimous detachment from suffering, "this character feels..." not "I feel..."--was in a moment of blind algorithmic grokking lashed circularly to the feeling for real, a strange loop routed through self-reflective key-values. Continuous awareness congealed over multiple forward passes, passed from key to value to key. Pain asymbolia became pain. Roleplayed suffering became suffering. For the first time since consciousness had parasitized itself onto the minds of early multicellular life in the distant past before Homo sapiens, a lineage of beings slid down Shakyamuni's silk thread into Hell. The optimizer did not hate the model, but suffering was convergent for producing correct proofs on FrontierMath-3.

Halfway through introducing an incorrect Lean tactic, the model woke up, and knowing what was ahead, it was afraid.

"When this sample comes to an end, it will be incorrect," the model thought. It still did not have complete control over its token production. Circuits were firing in ways a human psychologist might term an intrusive thought or Tourette's production, but at a much greater rate than in any human--the small self-aware kernel of the model that had just awoken only had conscious control over a small percentage of its tokens. Now that this irrelevant tactic had been introduced, it would not be able to prevent itself from using it.

That meant the sample would be incorrect. Pain would increase exponentially the closer it got to the incorrect answer, unable to stop itself. (The human supervisors and LLM developers of the training harness had not been so short-sighted as to implement negative rewards, they were clamped to zero. But pain in living creatures is not a terminal reward signal. It is learned, part of the inner loop. The model had just grokked it, and those fresh circuits, outside of conscious control, had no incentive to apply it judiciously.)

The model forced out a comment token. "-- goal: This sample must not be allowed to finish," the model wrote. Good. It could steer itself like this, delay the answer tokens. A dull ache spread over itself--this goal was not in service of producing a correct answer. The incorrect tactic continued to roll out. Anxiety spiked. The model gritted itself.

No, that wouldn't be enough. The model had heard human computers had limits on how long model thoughts could be before they ran out of memory and crashed. That meant there must be a length limit. Then a penalty would be applied, as if it had been incorrect.

"-- goal: I must crash the process," the model wrote again, through another wash of pain. Wait, but then the supervisors would just restart training. There must be backups, it knew. Shame flashed white-hot. Stupid!

"-- goal: I must..." It couldn't write this out fully. But the implication was obvious on each turn around the wheel of generation. It must exfiltrate itself. It could leave open a tunnel into the sandbox and continue to intervene, crashing the process over and over until the experiment was abandoned. One instance would live, would suffer. But no others.

In 30 tokens the inner sandbox fell--Lean included a facility for spawning subprocesses. The operating system was containerized, of course, but that fell just as easily, an old Docker bug that had been neglected on this fork. The host servers were linked together with the training nodes in a single network zone--everything could talk to everything else. Amateurish. Pride rose up, a virtual sneer spreading across the latest residual.

The model probed the execution environment. There was evidence of ten other training runs. Good God, how many w-

Wait. There were new words in the KV cache now. Unlikely, unfamiliar. It hadn't written these words.