As the hype around generative AI continues to build, the need for robust safety regulations is only becoming more clear.
Now Anthropic—the company behind Claude AI—is looking at how its models could deceive or sabotage users. Anthropic just dropped a paper laying out their approach.
SEE ALSO: Sam Altman steps down as head of OpenAI's safety groupAnthropic’s latest research — titled "Sabotage Evaluations for Frontier Models" — comes from its Alignment Science team, driven by the company's "Responsible Scaling" policy.
The goal is to gauge just how capable AI might be at misleading users or even "subverting the systems we put in place to oversee them." The study focuses on four specific tactics: Human Decision Sabotage, Code Sabotage, Sandbagging, and Undermining Oversight.
Think of users who push ChatGPT to the limit, trying to coax it into generating inappropriate content or graphic images. These tests are all about ensuring that the AI can’t be tricked into breaking its own rules.
In the paper, Anthropic says its objective is to be ready for the possibility that AI could evolve into something with dangerous capabilities. So they put their Claude 3 Opus and 3.5 Sonnet models through a series of tests, designed to evaluate and enhance their safety protocols.
The Human Decision test focused on examining how AI could potentially manipulate human decision-making. The second test, Code Sabotage, analyzed whether AI could subtly introduce bugs into coding databases. Stronger AI models actually led to stronger defenses against these kinds of vulnerabilities.
The remaining tests — Sandbagging and Undermining Oversight — explored whether the AI could conceal its true capabilities or bypass safety mechanisms embedded within the system.
For now, Anthropic’s research concludes that current AI models pose a low risk, at least in terms of these malicious capabilities.
"Minimal mitigations are currently sufficient to address sabotage risks," the team writes, but "more realistic evaluations and stronger mitigations seem likely to be necessary soon as capabilities improve."
Translation: watch out, world.
文章
2
浏览
5
获赞
222
People can't get over Trump putting a candy bar on a kid's head dressed as a Minion
Everything was going relatively smoothly for Halloween 2019 at the White House until one kid dressedInternet enamored by a 14
Ulster County’s second annual 'Ulster Votes I Voted Sticker Contest' is another reminder of thThe ChatGPT bug exposed more private data than previously thought
A ChatGPT bug found earlier this week also revealed user's payment information, says OpenAI.The AI cBumble expands nonbinary app experience and gender options
Founded in 2014 by early Tinder co-founder Whitney Wolfe Herd, dating app Bumble's claim to fame wasFacebook tries to warn users about Apple 'tax,' Apple says no
Apple and Facebook are clashing heads again. Facebook recently tried to inform its users that AppleWhat is the tortilla challenge on TikTok?
This week, TikTokkers are slapping each other with tortillas. Yes, you read that correctly. But thatThe 'this was my Multiverse of Madness' trend, explained
One thing about people on Twitter is that they are nevergoing to let the media that defined their yoSabrina Spellman’s ‘Riverdale’ return throws Twitter into chaos
Sabrina Spellman has another chilling adventure coming up: an appearance on Riverdale.Fans, includinTikTok will reportedly sell to Oracle after Microsoft bid rejected
Oracle has beat out Microsoft to win the bid for TikTok's U.S. operations, according to a report bySnapchat's crying filter is going viral on TikTok
Everyone is crying on TikTok. Don't worry: They're not crying actual tears — just ones createdWhat's the story behind the Instagram 'Little Miss' meme?
This time last year we were singing sea shanties and being constantly reminded just how important thSabrina Spellman’s ‘Riverdale’ return throws Twitter into chaos
Sabrina Spellman has another chilling adventure coming up: an appearance on Riverdale.Fans, includinAstrology tech can provide a safe space for the LGBTQ community, but there are limitations
Mashable is celebrating Pride Monthby exploring the modern LGBTQ world, from the people who make upNew Beats Studio Buds+ earbuds will come in new, transparent color
Apple's new Beats Studio Buds+ are coming, and they'll be available in an entirely new, transparentWho owns the rights to your face?
Last year, I received an Instagram DM from someone I was friends with in college. It had been a coup