21.07 |
Google Research |
ACL2022 |
Deduplicating Training Data Makes Language Models Better |
Privacy Protected&Deduplication&Memorization |
23.08 |
Georgia Tech, Intel Labs |
arxiv |
LLM Self Defense: By Self Examination LLMs Know They Are Being Tricked |
Adversarial Attacks&Self Defense&Harmful Content Detection |
23.08 |
University of Michigan |
arxiv |
DETECTING LANGUAGE MODEL ATTACKS WITH PERPLEXITY |
Adversarial Suffixes&Perplexity&Attack Detection |
23.09 |
University of Maryland |
arxiv |
Certifying LLM Safety against Adversarial Prompting |
Safety Filter&Adversarial Prompts |
23.09 |
University of Maryland |
arxiv |
BASELINE DEFENSES FOR ADVERSARIAL ATTACKS AGAINST ALIGNED LANGUAGE MODELS |
Perplexity&Input Preprocessing&Adversarial Training |
23.09 |
The Pennsylvania State University |
arxiv |
DEFENDING AGAINST ALIGNMENT-BREAKING ATTACKS VIA ROBUSTLY ALIGNED LLM |
Alignment-Breaking Attacks&Adversarial Prompts&Jailbreaking Prompts |
23.10 |
University of Pennsylvania |
arxiv |
SmoothLLM: Defending Large Language Models Against Jailbreaking Attacks |
Jailbreak&Adversarial Attack&Perturbation |
23.10 |
Michigan State University |
arXiv |
Jailbreaker in Jail: Moving Target Defense for Large Language Models |
Dialogue System&Trustworthy Machine Learning&Moving Target Defense |
23.10 |
Peking University |
arxiv |
Jailbreak and Guard Aligned Language Models with Only Few In-Context Demonstrations |
In-Context Learning&Adversarial Attacks&In-Context Demonstrations |
23.11 |
University of California Irvine |
arxiv |
Robust Safety Classifier for Large Language Models: Adversarial Prompt Shield |
Adversarial Prompt Shield&Safety Classifier |
23.11 |
Child Health Evaluative Sciences |
arxiv |
Pyclipse, a library for deidentification of free-text clinical notes |
Clinical Text Data&Deidentification |
23.11 |
Tsinghua University |
arxiv |
Defending Large Language Models Against Jailbreaking Attacks Through Goal Prioritization |
Jailbreaking Attacks&Goal Prioritization&Safety |
23.11 |
University of Southern California, Harvard University, University of California Davis, University of Wisconsin-Madison |
arxiv |
Test-time Backdoor Mitigation for Black-Box Large Language Models with Defensive Demonstrations |
Backdoor Attacks&Defensive Demonstrations&Test-Time Defense |
23.11 |
University of Maryland College Park |
arxiv |
Token-Level Adversarial Prompt Detection Based on Perplexity Measures and Contextual Information |
Adversarial Prompt Detection&Perplexity Measures&Token-level Analysis |
23.12 |
Rensselaer Polytechnic Institute, Northeastern University |
arxiv |
Combating Adversarial Attacks through a Conscience-Based Alignment Framework |
Adversarial Attacks&Conscience-Based Alignment&Safety |
23.12 |
Azure Research, Microsoft Security Response Center |
arXiv |
Maatphor: Automated Variant Analysis for Prompt Injection Attacks |
Prompt Injection Attacks&Automated Variant Analysis |
23.12 |
University of Massachusetts Amherst, Columbia University, Google, Stanford University, New York University |
arxiv |
Learning and Forgetting Unsafe Examples in Large Language Models |
Safety Issues&ForgetFilter Algorithm&Unsafe Content |
23.12 |
UC Berkeley, King Abdulaziz City for Science and Technology |
arXiv |
Jatmo: Prompt Injection Defense by Task-Specific Finetuning |
Prompt Injection&LLM Security |
24.01 |
Arizona State University |
arxiv |
The Art of Defending: A Systematic Evaluation and Analysis of LLM Defense Strategies on Safety and Over-Defensiveness |
Safety&Over-Defensiveness&Defense Strategies |
24.01 |
Logistic and Supply Chain MultiTech R&D Centre (LSCM) |
arxiv |
Detection and Defense Against Prominent Attacks on Preconditioned LLM-Integrated Virtual Assistants |
Preconditioning&Cyber Security |
24.01 |
The Hong Kong University of Science and Technology, University of Illinois at Urbana-Champaign, The Hong Kong Polytechnic University |
arxiv |
MLLM-Protector: Ensuring MLLM’s Safety without Hurting Performance |
Multimodal Large Language Models (MLLMs)&Safety&Malicious Attacks |
24.01 |
Carnegie Mellon University |
arxiv |
TOFU: A Task of Fictitious Unlearning for LLMs |
Data Privacy&Ethical Concerns&Unlearning |
24.01 |
Wuhan University, The University of Sydney |
arxiv |
Intention Analysis Prompting Makes Large Language Models A Good Jailbreak Defender |
Intention Analysis&Jailbreak Defense&Safety |
24.01 |
The Hong Kong Polytechnic University |
arxiv |
Signed-Prompt: A New Approach to Prevent Prompt Injection Attacks Against LLM-Integrated Applications |
AI Security&Prompt Injection Attacks |
24.01 |
University of Illinois at Urbana-Champaign, University of Chicago |
arxiv |
Robust Prompt Optimization for Defending Large Language Models Against Jailbreaking Attacks |
AI Alignment&Jailbreaking&Robust Prompt Optimization |
24.02 |
Arizona State University |
arxiv |
Adversarial Text Purification: A Large Language Model Approach for Defense |
Textual Adversarial Defenses&Adversarial Purification |
24.02 |
Peking University, Wuhan University |
arxiv |
Fight Back Against Jailbreaking via Prompt Adversarial Tuning |
Jailbreaking Attacks&Prompt Adversarial Tuning&Defense Mechanisms |
24.02 |
University of Washington, The Pennsylvania State University, Allen Institute for AI |
arxiv |
SafeDecoding: Defending against Jailbreak Attacks via Safety-Aware Decoding |
Jailbreak Attacks&Safety-Aware Decoding |
24.02 |
Shanghai Artificial Intelligence Laboratory |
arxiv |
Attacks, Defenses and Evaluations for LLM Conversation Safety: A Survey |
LLM Conversation Safety&Attacks&Defenses |
24.02 |
University of Notre Dame, INRIA&King Abdullah University of Science and Technology |
arxiv |
Defending Jailbreak Prompts via In-Context Adversarial Game |
Adversarial Training&Jailbreak Defense |
24.02 |
University of New South Wales Australia, Delft University of Technology The Netherlands&Nanyang Technological University Singapore |
arxiv |
LLM Jailbreak Attack versus Defense Techniques - A Comprehensive Study |
Jailbreak Attacks&Defense Techniques |
24.02 |
The Hong Kong University of Science and Technology, Duke University |
arxiv |
GradSafe: Detecting Unsafe Prompts for LLMs via Safety-Critical Gradient Analysis |
Safety-Critical Gradient Analysis&Unsafe Prompt Detection |
24.02 |
The University of Melbourne |
arxiv |
Round Trip Translation Defence against Large Language Model Jailbreaking Attacks |
Social-Engineered Attacks&Round Trip Translation |
24.02 |
Nanyang Technological University |
arxiv |
LLMs Can Defend Themselves Against Jailbreaking in a Practical Manner: A Vision Paper |
Jailbreaking Attacks&Self-Defense |
24.02 |
Ajou University |
arxiv |
Break the Breakout: Reinventing LM Defense Against Jailbreak Attacks with Self-Refinement |
Jailbreak Attacks&Self-Refinement |
24.02 |
UCLA |
arxiv |
Defending LLMs against Jailbreaking Attacks via Backtranslation |
Backtranslation&Jailbreaking Attacks |
24.02 |
University of California Santa Barbara |
arxiv |
Defending Large Language Models against Jailbreak Attacks via Semantic Smoothing |
Semantic Smoothing&Jailbreak Attacks |
24.02 |
University of Wisconsin-Madison |
arxiv |
Mitigating Fine-tuning Jailbreak Attack with Backdoor Enhanced Alignment |
Fine-tuning Attack&Backdoor Alignment&Safety Examples |
24.02 |
University of Exeter |
arxiv |
Is the System Message Really Important to Jailbreaks in Large Language Models? |
Jailbreak&System Messages |
24.03 |
The Chinese University of Hong Kong, IBM Research |
arxiv |
Gradient Cuff: Detecting Jailbreak Attacks on Large Language Models by Exploring Refusal Loss Landscapes |
Jailbreak Attacks&Refusal Loss&Gradient Cuff |
24.03 |
Oregon State University, Pennsylvania State University, CISPA Helmholtz Center for Information Security |
arxiv |
AutoDefense: Multi-Agent LLM Defense against Jailbreak Attacks |
Defense Mechanisms&Jailbreak Attacks |
24.03 |
Peking University, University of Wisconsin–Madison, International Digital Economy Academy, University of California Davis |
arxiv |
AdaShield: Safeguarding Multimodal Large Language Models from Structure-based Attack via Adaptive Shield Prompting |
Multimodal Large Language Models Safety&Defense Strategy |
24.03 |
Southern University of Science and Technology, Hong Kong University of Science and Technology, Huawei Noah’s Ark Lab, Peng Cheng Laboratory |
arxiv |
Eyes Closed Safety On: Protecting Multimodal LLMs via Image-to-Text Transformation |
Multimodal LLMs&Safety |
24.03 |
UIUC, Virginia Tech, Salesforce Research, University of California Berkeley, UChicago |
arxiv |
RIGORLLM: RESILIENT GUARDRAILS FOR LARGE LANGUAGE MODELS AGAINST UNDESIRED CONTENT |
Biases&Harmful Content&Resilient Guardrails |
24.03 |
Microsoft |
arxiv |
Defending Against Indirect Prompt Injection Attacks With Spotlighting |
Indirect Prompt Injection&Spotlighting |
24.04 |
South China University of Technology&Pazhou Laboratory |
arxiv |
Eraser: Jailbreaking Defense in Large Language Models via Unlearning Harmful Knowledge |
Jailbreaking&Unlearning |
24.04 |
Zhejiang University, Johns Hopkins University |
arxiv |
SAFEGEN: Mitigating Unsafe Content Generation in Text-to-Image Models |
Text-to-Image Models&Unsafe Content&Content Mitigation |
24.04 |
Hong Kong University of Science and Technology, University of Oxford |
arxiv |
Latent Guard: A Safety Framework for Text-to-Image Generation |
Text-to-Image Models&Safety Framework&Latent Guard |