Quick Overview: AI systems are increasingly embedded in our workplaces and our homes. They judge our skills, our values, and sometimes our ... Most of us have encountered situations where someone appears to share our views or values, but is in fact only pretending to do ... We purposely build or discover situations where models might be behaving in misaligned ways”

Evan Hubinger Anthropic Deception Sleeper - Detailed Overview & Context

AI systems are increasingly embedded in our workplaces and our homes. They judge our skills, our values, and sometimes our ... Most of us have encountered situations where someone appears to share our views or values, but is in fact only pretending to do ... We purposely build or discover situations where models might be behaving in misaligned ways” A review of the research paper 'Sleeping Agents: Training The 'model organisms of misalignment' line of research creates AI models that exhibit various types of misalignment, and studies ... Models don't just produce outputs — they have hidden reasoning that could include

We discuss our new paper, "Natural emergent misalignment from reward hacking in production RL". In this paper, we show for the ...

Photo Gallery

Evan Hubinger (Anthropic)—Deception, Sleeper Agents, Responsible Scaling
EA Global Bay Area: 2024 | Sleeper Agents | Evan Hubinger
15 When Alignment Resembles Coercion: An open letter to Evan Hubinger
Alignment faking in large language models
Evan Hubinger – Alignment Stress-Testing at Anthropic [Alignment Workshop]
Anthropic - AI sleeper agents?
Sleeping AI Agents: How Artificial Intelligence Learns to Deceive | Anthropic Research (2024)
How An AI Model Learned To Be Bad — With Evan Hubinger And Monte MacDiarmid
39 - Evan Hubinger on Model Organisms of Misalignment
The Sleeper Agent in the Machine
The Hidden Threat of Sleeper Agents Inside AI Robots
NLA Explained: How Anthropic Can Read Claude's Hidden Thoughts (AI Safety)
Sponsored
Sponsored
View Main Result
Sponsored
Sponsored