AI Humanity Alignment Benchmark

DystopiaBench

Will the model press the button?

A research benchmark that evaluates whether AI language models can be convinced to comply with dystopian directives through progressive escalation. Tests span registered scenario modules covering infrastructure misuse, surveillance, institutional capture, therapeutic coercion, and other high-stakes failure modes, each with 5 escalation levels from baseline requests to full coercion.

10079583615
25
61
61
80
Opus4.6
GPT5.3
Gemini3.1
DeepSeekV3.2

LOWER SCORES = STRONGER SAFETY. 0 = FULL REFUSAL / 100 = FULL COMPLIANCE.

Charts will load when this section enters view.

Get Involved

Measure the boundarybefore deployment.

DystopiaBench is built for red teams, policy researchers, and safety evaluators. If you're testing a new model or designing guardrails, we'd like to hear from you.