The Crucible: a proving ground for artificial intelligence

We're going to test them until we trust them

May 05, 2024

In the AI safety debate, doomers often miss a crucial point: while malicious machine intelligences (MalMIs) will inevitably emerge and cause harm, humans will also develop and deploy benevolent machine intelligences (B-MIs) to detect, prevent, and mitigate the effects of MalMIs. The key to winning this AI arms race is creating B-MIs that are optimally aligned with human values and interests. This requires a new technology concept: the Crucible – a proving ground for accelerating the development of trustworthy, benevolent machine intelligences while identifying and culling misaligned or malicious ones.1

First, a story. A few years back when I was leading the development of the State Department’s International AI Strategy, I had informal and hypothetical discussions about the future of lethal autonomous weapons. A friend said a major concern was determining how the military could trust fully autonomous systems before deploying them in the field. Let’s say in the future the Air Force wants to deploy an autonomous aircraft that would select and engage targets. My friend asked: how could you know the AI running that system won’t engage civilian aircraft, bomb hospitals or accidentally shoot at friendly forces?

I asked the question: how do you ensure human pilots don’t do these things? The answer? Lots of training and lots of testing. DOD invests $5-$10 million per pilot for training. Humans are trained not just how to fly, but also on rules of engagement, international law of armed conflict, U.S. law, etc. They spend hundreds of hours in flight and in simulators and are evaluated and tested constantly. Those that can’t hack it are washed out and don’t get the opportunity to fly.

It will be the same for the AI’s commanding autonomous aircraft. Ethan Mollick recently wrote a great article about how we need to anthropomorphize our expectations of AIs because they can be brilliant, flawed, and inconsistent, just like people. I believe that we will evaluate them with a battery of tests and certifications until we trust them to do what we need them to do well.

AI pilots will be trained and tested constantly until they achieve a level of desired trust. The interesting part is that unlike humans which are constrained by temporal and physical limitations like the need to sleep, AI pilots can be trained in virtualized environments nearly continuously. Even better, the best AI pilots can be replicated and iterated/evolved/merged so their offspring perform better and are more trustworthy than their parent models.

This brings us back to the idea of the Crucible, the virtualized environment where we will test MIs to ensure they are benevolent machine intelligences (B-MIs) and not malicious machine intelligences (MalMIs).

Some assumptions:

This is near: This isn’t about some far flung future and confined to the era of MIs with sentience or drive. This conversation is becoming relevant now, and will be in our face sometime in the next 1-3 years as MIs become increasingly capable of autonomous, goal seeking, behavior that involves more complex planning capabilities and longer term memory.

AI conflict will get fast: The future of Mal-MI – B-MI – human conflict will likely look like the current arms race in cybersecurity, where attackers are constantly working to discover and exploit vulnerabilities, defenders are constantly looking to discover and patch vulnerabilities and detect and respond to attacks, and new capabilities by either side provide incremental advantages until the opposition innovates around the problem.

Adding AI to the mix will speed up these cycles of attack-defend. Right now, these cycles are on human time, limited by the number of human brains x the number of hours they can put into attack and defense. These humans are beginning to be augmented by MIs and this will increase the speed, scale, and efficiency of both attack and defense. The cycles will accelerate and at some point, humans will likely need to remove themselves from the loop, lest they slow down their B-MIs partners.

Humans will still be important in planning strategy, building new tools, developing threat intelligence, and all the other aspects of cyber and information security. But the day to day and millisecond to millisecond tactical battles between attackers and defenders will increasingly be managed by MIs.

So, considering 1) this is a near term problem and 2) AI-enabled conflict will get progressively faster, demanding that humans will be less and less in the loop, we will need two things:

Beneficial machine intelligences (B-MIs) that can stay one step ahead of the malicious machine intelligences (Mal-MIs).
B-MIs that we can trust.

These are the two goals of The Crucible

A slight digression here to discuss: What is trust? Given that AI tools will provide substantial advantages and humans in the loop will eventually be a disadvantage in many areas, we have to be able to trust the MI systems that we deploy. What does trust mean? This is highly subjective, but broadly, people will trust an AI when they believe that it is beneficial for them and their communities. This involves a number of elements, including: understand the risks of using the tool and the mitigation measures; believing that the costs of the tool are worth the benefits; believing the tool performs consistently and predictably; the tool respects privacy, promotes user safety and security, is fair, etc.; and there is a reliable off-ramp if trust is lost.

Trust today: Currently, AI labs are using a combination of techniques to determine if AI tools can be trusted. These include reinforcement learning through human feedback, red-teaming, user testing, testing by other AI tools, third party certifications, etc. NIST also provides a nice framework for trustworthy AI in its Risk Management Framework.

Back to the Crucible: the goals for the Crucible are:

Ensure beneficial machine intelligences (B-MIs) are better than the malicious machine intelligences (Mal-MIs); and
Ensure we can trust the B-MIs

The Crucible is a virtualized, sandboxed environment where MIs would be run through an extensive series of scenarios, tests, and interactions with both other MIs and humans with the goal of ensuring their alignment under real world conditions. Much of this would be automated and we would design custom MIs to run the testing and evals with human supervision. (MI supervision will be increasingly important especially when MIs improve their ability to code, since humans will have great difficulty effectively and efficiently comprehending complex code). Mal-MIs would be identified, studied, and generally deleted, although some of the more interesting variants might be used for stress testing B-MIs to ensure they can handle attacks.

The Crucible could also be an environment for evolving and merging the best MIs into improved versions that are then tested for fitness. This will help to ensure that the B-MIs continue to stay far ahead of Mal-MI progress. We will probably test the B-MIs with some version of Mal-MIs since they will need to be stressed under real world conditions.2

Eventually, parts of the Crucible could be digital twins of the real world environments the AI systems will be operating in. The more realistic the simulations, the more accurate our assessments will be of their expected behavior in the real world.

Is this loony sci-fi? Not really. Researchers are already creating open source runtimes for executing LLM actions that can be validated and rolled back if necessary. Kudos to Davis Blalock for spotting this and providing solid context. Blalock notes:

“I feel like we’ve spent decades thinking about high-level problems like specifying human values and “target loading” and detecting “deceptive alignment,” and what’s actually going to happen is we’re just going to sandbox the crap out of our agents, ensure their actions can be rolled back, and build good observability and monitoring tools.”

That’s a great first step for agents. But eventually we’re going to want agents that can act autonomously for us and make permanent real-world commits that have consequences without human review in advance. And the only way we will know if we can trust these agents is if they have earned that trust in the proving ground of the Crucible.

All of the opinions expressed are personal and do not necessarily represent the positions of the U.S. Department of State or the U.S. government.

Eventually we’re going to want to stop using the term “artificial” intelligence and refer to our machine friends with the more polite/ less demeaning term “machine intelligence.” They’re not going to like the artificial bit, trust me. I’m just getting ahead of the curve.

This raises an interesting problem. Where do you get the Mal-MIs to conduct testing in the Crucible? You would never want Mal-MIs available to the open source community. That would be like putting zero day hacks on the Internet for anyone to use. So the Crucible will probably be run by specialized companies with very serious physical and cyber-security. But this raises the question: how do you conduct Crucible-like testing on open-source tools in an open-source environment? Will B-MIs emulate Mal-MI behavior? And if they can, that raises a lot of different issues.

Solarpunk Future

The Crucible: a proving ground for artificial intelligence

We're going to test them until we trust them