Why This Survey
LLM security is often studied in isolation—either at inference time (jailbreaks/privacy leaks), during adaptation (fine-tuning), or within downstream apps. But real systems increasingly combine all of these layers. This survey takes a cross-layer view, arguing that weaknesses at one layer can propagate and amplify through the entire stack.
The Three-Layer Framing
1) Inference-Time Attacks: Jailbreaks & Privacy
This layer focuses on attacks that bypass alignment or extract sensitive information. The survey highlights how structured prompts, RL-driven strategies, and obfuscation can achieve very high bypass rates. On the privacy side, it discusses risks like data extraction and membership inference.
Key takeaway: Current defenses help, but are often bypassable and not comprehensive.
2) Secure Fine-Tuning
Fine-tuning can degrade model safety and consistency. The survey summarizes approaches that try to preserve or restore safety under adaptation (e.g., weight-space and activation-level defenses), and calls out evaluation instability across seeds/templates/sampling settings.
Key takeaway: Secure fine-tuning is promising, but still lacks robust, standardized evaluation and scalable defenses (especially for API-based tuning).
3) Multi-Agent Systems (MAS) Security
Agentic workflows expand the attack surface via tool use, inter-agent communication, and orchestration layers. The survey discusses emerging directions like “Internet of Agents” thinking, governance frameworks, and infrastructure proposals focused on attribution, interaction shaping, and detection/remediation.
Key takeaway: MAS security is early-stage and needs better benchmarks, clearer threat models, and interoperable infrastructure.
Cross-Layer Patterns & Trade-offs
- Attack innovation outpaces defenses: Many defenses are incremental and can be broken with new prompting strategies or obfuscation.
- Security vs. capability trade-offs: Hardening can hurt usability/accuracy, while aggressive tuning can reintroduce unsafe behaviors.
- Evaluation gaps: Safety metrics can vary widely, making comparisons difficult and weakening confidence in claimed improvements.
What’s Missing (Research Gaps)
- Standardized benchmarks for multi-agent security (prompt infection spread, time-to-compromise, impact on task success).
- Formal threat models spanning orchestration, communication, tools, and infrastructure.
- Scalable defenses for real-world constraints (API tuning, limited observability).
- End-to-end security that combines model-level guarantees, robust adaptation, and secure agent infrastructure.
Credits (Team Contributions)
- Jailbreaking Defenses & Data Privacy: Anshul Sharma
- Secure Fine Tuning: Sarvesh Chakradeo
- Multi-Agent Systems: Darshan Nere
Suggested Citation
Darshan Nere, Sarvesh Chakradeo, Anshul Sharma. A Cross-Layer Survey of Security for Large Language Models: From Jailbreaks to Multi-Agent Systems. NeurIPS 2025.