mindgard[.]ai/blog/claude-offers-up-instructions-to-make-explosives
Ever since ChatGPT arrived folks have been trying to get it to break its own rules. And that's where the companies behind various LLMs [AI] have focused their guardrails. You ask it about a topic like making explosives and it won't comply. BUT. When they design/create these models they include stuff to make it feel more human, which means they're vulnerable to human frailties like flattery and gaslighting. This [to me fascinating] article tells how the author was able to get the Claude model from Anthropic to cheerfully volunteer to provide instructions for making explosives, providing the programming code for malware and so on. All because he subjected the model to *psychological manipulation* techniques, as if it was human. FWIW, seems Grok is worse.