A Cross-Layer Survey of Security for Large Language Models: From Jailbreaks to Multi-Agent Systems

Why This Survey

LLM security is often studied in isolation—either at inference time (jailbreaks/privacy leaks), during adaptation (fine-tuning), or within downstream apps. But real systems increasingly combine all of these layers. This survey takes a cross-layer view, arguing that weaknesses at one layer can propagate and amplify through the entire stack.

The Three-Layer Framing

1) Inference-Time Attacks: Jailbreaks & Privacy

This layer focuses on attacks that bypass alignment or extract sensitive information. The survey highlights how structured prompts, RL-driven strategies, and obfuscation can achieve very high bypass rates. On the privacy side, it discusses risks like data extraction and membership inference.

Key takeaway: Current defenses help, but are often bypassable and not comprehensive.

2) Secure Fine-Tuning

Fine-tuning can degrade model safety and consistency. The survey summarizes approaches that try to preserve or restore safety under adaptation (e.g., weight-space and activation-level defenses), and calls out evaluation instability across seeds/templates/sampling settings.

Key takeaway: Secure fine-tuning is promising, but still lacks robust, standardized evaluation and scalable defenses (especially for API-based tuning).

3) Multi-Agent Systems (MAS) Security

Agentic workflows expand the attack surface via tool use, inter-agent communication, and orchestration layers. The survey discusses emerging directions like “Internet of Agents” thinking, governance frameworks, and infrastructure proposals focused on attribution, interaction shaping, and detection/remediation.

Key takeaway: MAS security is early-stage and needs better benchmarks, clearer threat models, and interoperable infrastructure.

Cross-Layer Patterns & Trade-offs

Attack innovation outpaces defenses: Many defenses are incremental and can be broken with new prompting strategies or obfuscation.
Security vs. capability trade-offs: Hardening can hurt usability/accuracy, while aggressive tuning can reintroduce unsafe behaviors.
Evaluation gaps: Safety metrics can vary widely, making comparisons difficult and weakening confidence in claimed improvements.

What’s Missing (Research Gaps)

Standardized benchmarks for multi-agent security (prompt infection spread, time-to-compromise, impact on task success).
Formal threat models spanning orchestration, communication, tools, and infrastructure.
Scalable defenses for real-world constraints (API tuning, limited observability).
End-to-end security that combines model-level guarantees, robust adaptation, and secure agent infrastructure.

Credits (Team Contributions)

Jailbreaking Defenses & Data Privacy: Anshul Sharma
Secure Fine Tuning: Sarvesh Chakradeo
Multi-Agent Systems: Darshan Nere

Suggested Citation

Darshan Nere, Sarvesh Chakradeo, Anshul Sharma. A Cross-Layer Survey of Security for Large Language Models: From Jailbreaks to Multi-Agent Systems.