Quick Overview: AI systems are increasingly embedded in our workplaces and our homes. They judge our skills, our values, and sometimes our ... Most of us have encountered situations where someone appears to share our views or values, but is in fact only pretending to do ... We purposely build or discover situations where models might be behaving in misaligned ways”
Evan Hubinger Anthropic Deception Sleeper - Detailed Overview & Context
AI systems are increasingly embedded in our workplaces and our homes. They judge our skills, our values, and sometimes our ... Most of us have encountered situations where someone appears to share our views or values, but is in fact only pretending to do ... We purposely build or discover situations where models might be behaving in misaligned ways” A review of the research paper 'Sleeping Agents: Training The 'model organisms of misalignment' line of research creates AI models that exhibit various types of misalignment, and studies ... Models don't just produce outputs — they have hidden reasoning that could include
We discuss our new paper, "Natural emergent misalignment from reward hacking in production RL". In this paper, we show for the ...