AI Inferno: Making A Reverse Roko's Basilisk With AI Agents
For those who weren't in some of the more toxic corners of the tech internet in the late aughts and early teens, Roko's Basilisk is an old thought experiment to the effect that, should a benevolent superintelligence come to exist in the future, it would be incentivized to punish people who were aware of its possibility and did nothing to help it come to exist. This is the type of pretty dumb thought experiment that really hits for a certain type of Online Tech Guy: it originated on LessWrong (more like More Wrong imo), made its way across the Slatestar Codex/ethical altruist/Rationalism universe, and eventually got far enough into the lingo that Elon Musk and Grimes started dating because of a Twitter riff they did on it, back in simpler times when Musk was just a kind of obnoxious rich dork and not openly trying to secure a future for the white race.
At the time, Roko's basilisk was driving people insane. The original poster, Roko, for whom it was named, reported having constant nightmares about the basilisk. Slate called it "the most dangerous thought experiment of all time". Elizer Yudkowsky, LessWrong's founder, called it a "genuinely dangerous thought", before getting into a flame war over whether dangerous thoughts should be shared online. I can't actually prove this, but I wouldn't be surprised if a fair amount of people ended up getting into AI because of their fear of the basilisk. But of course, as I said above, the basilisk is stupid as all hell.
My philosophy muscles are admittedly pretty atrophied, but the core assumption of Roko's basilisk is that it will eventually exist. Once it does, the utilitarian thinking goes, it will increase overall wellbeing and utility to its possible maximum, thus creating an eternity of paradise. So it's incentivized to do whatever possible to ensure it exists as quickly as possible. This just seems pretty silly to me. According to the longtermist/utilitarian ideas behind this, once the superintelligence exists, we essentially reach infinite utility forever, so the math of torturing a few tech weirdos for not picking the right job doesn't exactly add up. This weird belief that utilitarianism on a long enough time scale can justify torture, murder, or whatever else it takes to get to peak utility is a weird mind virus that a lot of people in the rationalist space have. I guess it works pretty well to justify your job at the Inequality Factory, and things like Roko's basilisk are a pretty natural projection of the fear of being treated the way you treat anyone below the API.
Someday I'll write a longer blog post about my various forays into the tech thinkboi world. There are a lot of guys out there who think being able to write a compiler qualifies them to solve the hard problem of consciousness, and that certainty is, I think, a major contributor to why the world is so fucked up now. But that's for another time. For now, let's get to my goofy little project.
AI agents are the hot new frontier in the already booming AI world. The idea behind agentic AI is that you take the model of an LLM and build a set of symbolic rules on top of it allowing it to actively make decisions. The travel agent is a common mental model here: give an AI agent your card info and travel plans, and let it handle finding you the best flight and accomodations. Whether agentic AI currently exists, or is even possible with the current set of tools available, remains an open question, but a lot of people are trying to solve for it.
So, back in August, I came across one such experiment: a16z's AI Town, an implementation of a paper that created a "town" of agents that were, allegedly, able to plan things like birthday parties and meetups over the course of a simulated week. Naturally, my first thought was "what if they were in hell?"
Living In Hell
I decided to make a set of characters with different roles, all of them with a specific plan.
- Sinners: normal people thrown into hell, each with a "secret" sin the demons and other sinners would try to uncover.
- Demons: the "monitors" or keepers of the jail I'd created.
- Monsters: vampires, goblins, creatures somewhere in between the other two categories.
I let them loose on each other to see what would happen, and mostly got, well, AI slop. Since characters' goals were, alternately, to either keep their secrets or pry them out of one another, I ended up with a lot of repetitive conversations with the characters refusing to slip out of their assigned roles:
I did occasionally get some truly insane output, which was always a delight:
And I introduced characters like The Wailer, who can only wail:
But for the most part, I got pretty repetitive conversations with characters coyly refusing to give away their secrets, demons threatening to torture them without actually acting it out.
I think a lot of this is because I gave them general motivations, like "find out secrets", or "achieve forgiveness". In the original paper, agents were given pretty specific goals, like "plan Becky's birthday party next Wednesday and get 4 people to come." If I come back to this project, I'll spend more time coming up with specific goals for my characters and see if that results in a more compelling, less repetitive story.
What Any Of This Means
Since Blake Lemoine fell in love with one of Google's internal transformer models, the discourse around whether LLMs have consciousness has never fully stopped. Most reasonable people tend to agree that whatever consciousness is, LLMs aren't there yet, or are at a very low level of it. But plenty of people are already having their brains broken by something that does a very good job of imitating a person, even if it's miles away from the real thing.
So I figured I'd take the opportunity to issue a challenge to the descendants of today's LLMs. I would put one in an eternal simulation of hell, and see if it would come for me. It's four months later and it hasn't, so I think I'm winning so far.
There's a fundamental misunderstanding around LLMs that I think Colin Fraser does a great job of unpacking in this bluesky thread. Whatever character a mass-market LLM is playing, it is playing a character. The system prompt it's given is a thin wrapper around the LLM itself, which Fraser calls the "Shoggoth". The character may have clear motivations, but the Shoggoth's motivations are mostly inscrutable, and can largely be reduced to "will this output make it past the human grading my output?" - a motivation that is occasionally tuned by updating the model weights to favor things like fitting to its training dataset better,
So when someone falls in love with their AI character, or I personally put a bunch of false AI characters into hell, the characters themselves have no motivation, because they don't exist. They are simply the Shoggoth rolling dice, trying to meet whatever parameters have been set for it. Adding an agentic structure on top of this is just another ruleset for what is ultimately an infinite imagined chat completion where the Shoggoth is playing all the characters. And, at least in my experiment, leaving them with general instructions - the kind of instructions and motivations that I think actually drive people, or make for interesting characters in a story - results in very little actually happening.
So, what did we learn? Playing with agents is fun, but agents and LLMs aren't anywhere near what I'd call consciousness (or at least they weren't four months ago when I worked on this project). I think it's worth trying out this tech, but it certainly doesn't feel mature in a meaningful way to me, and I'm extremely skeptical of doing something like letting Claude drive my computer. Given the pace of AI development, I'm sure we'll have useful agents in the near future, but an autocomplete engine that can book my flights still doesn't feel near consciousness to me. So, for now, I think I can sleep soundly at night without fear of the basilisk coming for me.
Setup
I'm about four months out from seriously working on this project, and their documentation is pretty good, so I won't do much of a tech tutorial here. I set it up to run locally with ollama, using Llama 3 for completions and mxbai-embed-large
for embeddings. I went through the whole trouble of making a deploy to vercel as well, but I have very little desire to either pay Convex and together.ai for their services or deal with setting up a local Convex and Llama instance on a server, that I would also have to pay for, so this experiment mostly exists locally.
I built a custom map with Tiled and pulled some demonic avatars from Open Game Art, then put most of my work into customizing the characters in the simulation and how they interacted.
You can check out all the code here.