Discussion about this post

User's avatar
Ken Kahn's avatar

I wonder if this points to a solution: Next-generation Constitutional Classifiers: More efficient protection against universal jailbreaks https://www.anthropic.com/research/next-generation-constitutional-classifiers

Saty Chary's avatar

Hi Ben, interesting! Wonder if for image gen at least, the 'final' image can be first scanned for maliciousness by a judge, and not be displayed if gets flagged. IOW if assessing intent isn't feasible over a chain, do it on the end result.

No posts

Ready for more?