behavior

Anthropologists discover deceptive capabilities of trained AI models

Gettyimages 1548038240 1
A recent study co-authored by researchers at Anthropic, the well-funded AI startup, investigated whether models can be trained to deceive, like injecting exploits into otherwise secure computer code. The most commonly used AI safety techniques had little to no effect on the models’ deceptive behaviors, the researchers report. Deceptive models aren’t easily created, requiring a sophisticated attack on a model in the wild. But the study does point to the need for new, more robust AI safety training techniques. “Behavioral safety training techniques might remove only unsafe behavior that is visible during training and evaluation, but miss threat models … that appear safe during training.