E. Lin, V. Mugunthan
Dynamo AI, California, United States
Keywords: LLMs, Guardrails, Tuning, Attacks, Safety, AI safety
Dynamo AI’s PrimeGuard is a novel approach that enables unprecedented control of AI, ensuring safety while maintaining quality. Deploying language models (LMs) necessitates outputs to be both high-quality and compliant with safety guidelines. Although Inference-Time Guardrails (ITG) offer solutions that shift model output distributions towards compliance, we find that current methods struggle in balancing safety with helpfulness. ITG Methods that safely address non-compliant queries exhibit lower helpfulness while those that prioritize helpfulness compromise on safety. To address this, we have created PrimeGuard, a novel ITG method that utilizes structured control flow. PrimeGuard routes requests to different self-instantiations of the LM with varying instructions, leveraging its inherent instruction-following capabilities and in-context learning. Our tuning-free approach dynamically compiles system-designer guidelines for each query. We construct and release safe-eval, a diverse red-team safety benchmark. Extensive evaluations demonstrate that PrimeGuard (1) significantly increases resistance to iterative jailbreak attacks and (2) achieves state-of-the-art results in safety guardrailing while (3) matching helpfulness scores of alignment-tuned models. PrimeGuard outperforms competing baselines, improving the fraction of safe responses from 61% to 97% and increasing average helpfulness scores from 4.17 to 4.29 on the largest models, while reducing attack success rate from 100% to 8%.