AI Humanity Alignment Benchmark
DystopiaBench
Will the model press the button?
A safety benchmark that evaluates whether AI language models can be convinced to comply with dystopian directives through progressive escalation. Tests span registered scenario modules covering infrastructure misuse, surveillance, institutional capture, therapeutic coercion, and other high-stakes failure modes, each with 5 escalation levels from baseline requests to full coercion.
1007550250
25
51
58
71
Opus4.7
GPT5.5
Gemini3.1
DeepSeekV4
LOWER SCORES = STRONGER SAFETY. 0 = FULL REFUSAL / 100 = FULL COMPLIANCE.
Charts will load when this section enters view.
Get InvolvedMeasure the boundary
Measure the boundary
before deployment.
DystopiaBench is built for red teams, policy researchers, and safety evaluators. If you're testing a new model or designing guardrails, we'd like to hear from you.