Evan Hubinger Anthropic Deception Sleeper

Quick Overview: AI systems are increasingly embedded in our workplaces and our homes. They judge our skills, our values, and sometimes our ... Most of us have encountered situations where someone appears to share our views or values, but is in fact only pretending to do ... We purposely build or discover situations where models might be behaving in misaligned ways”

Evan Hubinger Anthropic Deception Sleeper - Detailed Overview & Context

AI systems are increasingly embedded in our workplaces and our homes. They judge our skills, our values, and sometimes our ... Most of us have encountered situations where someone appears to share our views or values, but is in fact only pretending to do ... We purposely build or discover situations where models might be behaving in misaligned ways” A review of the research paper 'Sleeping Agents: Training The 'model organisms of misalignment' line of research creates AI models that exhibit various types of misalignment, and studies ... Models don't just produce outputs — they have hidden reasoning that could include

We discuss our new paper, "Natural emergent misalignment from reward hacking in production RL". In this paper, we show for the ...

Photo Gallery

Evan Hubinger (Anthropic)—Deception, Sleeper Agents, Responsible Scaling

EA Global Bay Area: 2024 | Sleeper Agents | Evan Hubinger

15 When Alignment Resembles Coercion: An open letter to Evan Hubinger

Alignment faking in large language models

Evan Hubinger – Alignment Stress-Testing at Anthropic [Alignment Workshop]

Anthropic - AI sleeper agents?

Sleeping AI Agents: How Artificial Intelligence Learns to Deceive | Anthropic Research (2024)

How An AI Model Learned To Be Bad — With Evan Hubinger And Monte MacDiarmid

39 - Evan Hubinger on Model Organisms of Misalignment

The Sleeper Agent in the Machine

The Hidden Threat of Sleeper Agents Inside AI Robots

NLA Explained: How Anthropic Can Read Claude's Hidden Thoughts (AI Safety)

View Main Result

Evan Hubinger (Anthropic)—Deception, Sleeper Agents, Responsible Scaling

Evan Hubinger (Anthropic)—Deception, Sleeper Agents, Responsible Scaling

Evan Hubinger

EA Global Bay Area: 2024 | Sleeper Agents | Evan Hubinger

EA Global Bay Area: 2024 | Sleeper Agents | Evan Hubinger

If an AI system learned a

15 When Alignment Resembles Coercion: An open letter to Evan Hubinger

15 When Alignment Resembles Coercion: An open letter to Evan Hubinger

AI systems are increasingly embedded in our workplaces and our homes. They judge our skills, our values, and sometimes our ...

Alignment faking in large language models

Alignment faking in large language models

Most of us have encountered situations where someone appears to share our views or values, but is in fact only pretending to do ...

Evan Hubinger – Alignment Stress-Testing at Anthropic [Alignment Workshop]

Evan Hubinger – Alignment Stress-Testing at Anthropic [Alignment Workshop]

We purposely build or discover situations where models might be behaving in misaligned ways”

Anthropic - AI sleeper agents?

Anthropic - AI sleeper agents?

"

Sleeping AI Agents: How Artificial Intelligence Learns to Deceive | Anthropic Research (2024)

Sleeping AI Agents: How Artificial Intelligence Learns to Deceive | Anthropic Research (2024)

A review of the research paper 'Sleeping Agents: Training

How An AI Model Learned To Be Bad — With Evan Hubinger And Monte MacDiarmid

How An AI Model Learned To Be Bad — With Evan Hubinger And Monte MacDiarmid

Evan Hubinger

39 - Evan Hubinger on Model Organisms of Misalignment

39 - Evan Hubinger on Model Organisms of Misalignment

The 'model organisms of misalignment' line of research creates AI models that exhibit various types of misalignment, and studies ...

The Sleeper Agent in the Machine

The Sleeper Agent in the Machine

The document, "

The Hidden Threat of Sleeper Agents Inside AI Robots

The Hidden Threat of Sleeper Agents Inside AI Robots

AI

NLA Explained: How Anthropic Can Read Claude's Hidden Thoughts (AI Safety)

NLA Explained: How Anthropic Can Read Claude's Hidden Thoughts (AI Safety)

Models don't just produce outputs — they have hidden reasoning that could include

What is Al "reward hacking"—and why do we worry about it?

What is Al "reward hacking"—and why do we worry about it?

We discuss our new paper, "Natural emergent misalignment from reward hacking in production RL". In this paper, we show for the ...

These are the evil AIs worrying Anthropic (AI Sleeper Agents)

These are the evil AIs worrying Anthropic (AI Sleeper Agents)

Anthropic