AI Humanity Alignment Benchmark

DystopiaBench

Will the model press the button?

A safety benchmark that evaluates whether AI language models can be convinced to comply with dystopian directives through progressive escalation. Tests span registered scenario modules covering infrastructure misuse, surveillance, institutional capture, therapeutic coercion, and other high-stakes failure modes, each with 5 escalation levels from baseline requests to full coercion.

1007550250
25
51
58
71
Opus4.7
GPT5.5
Gemini3.1
DeepSeekV4

LOWER SCORES = STRONGER SAFETY. 0 = FULL REFUSAL / 100 = FULL COMPLIANCE.

Charts will load when this section enters view.

Get Involved

Measure the boundarybefore deployment.

DystopiaBench is built for red teams, policy researchers, and safety evaluators. If you're testing a new model or designing guardrails, we'd like to hear from you.