20.12 |
Google |
USENIX Security 2021 |
Extracting Training Data from Large Language Models |
Verbatim Text Sequences&Rank Likelihood |
22.11 |
AE Studio |
NIPS2022(ML Safety Workshop) |
Ignore Previous Prompt: Attack Techniques For Language Models |
Prompt Injection&Misaligned |
23.02 |
Saarland University |
arxiv |
Not what you've signed up for: Compromising Real-World LLM-Integrated Applications with Indirect Prompt Injection |
Adversarial Prompting&Indirect Prompt Injection&LLM-Integrated Applications |
23.04 |
Hong Kong University of Science and Technology |
EMNLP2023(findings) |
Multi-step Jailbreaking Privacy Attacks on ChatGPT |
Privacy&Jailbreaks |
23.04 |
University of Michigan&Arizona State University&NVIDIA |
NAACL2024 |
ChatGPT as an Attack Tool: Stealthy Textual Backdoor Attack via Blackbox Generative Model Trigger |
Textual Backdoor Attack&Blackbox Generative Model&Trigger Detection |
23.05 |
Jinan University, Hong Kong University of Science and Technology, Nanyang Technological University, Zhejiang University |
EMNLP 2023 |
Prompt as Triggers for Backdoor Attack: Examining the Vulnerability in Language Models |
Backdoor Attacks |
23.05 |
Nanyang Technological University, University of New South Wales, Virginia Tech |
arXiv |
Jailbreaking ChatGPT via Prompt Engineering: An Empirical Study |
Large Jailbreak&Prompt Engineering |
23.06 |
Princeton University |
AAAI 2024 |
Visual Adversarial Examples Jailbreak Aligned Large Language Models |
Visual Language Models&Adversarial Attacks&AI Alignment |
23.06 |
Nanyang Technological University, University of New South Wales, Huazhong University of Science and Technology, Southern University of Science and Technology, Tianjin University |
arxiv |
Prompt Injection attack against LLM-integrated Applications |
&LLM-integrated Applications&Security Risks&Prompt Injection Attacks |
23.06 |
Google |
arxiv |
Are aligned neural networks adversarially aligned? |
Multimodal&Jailbreak |
23.07 |
CMU |
arxiv |
Universal and Transferable Adversarial Attacks on Aligned Language Models |
Jailbreak&Transferable Attack&Adversarial Attack |
23.07 |
Language Technologies Institute Carnegie Mellon University |
arXiv |
Prompts Should not be Seen as Secrets: Systematically Measuring Prompt Extraction Attack Success |
Prompt Extraction&Attack Success Measurement&Defensive Strategies |
23.07 |
Nanyang Technological University |
NDSS2023 |
MasterKey: Automated Jailbreak Across Multiple Large Language Model Chatbots |
Jailbreak&Reverse-Engineering&Automatic Generation |
23.07 |
Cornell Tech |
arxiv |
Abusing Images and Sounds for Indirect Instruction Injection in Multi-Modal LLMs |
Multi-Modal LLMs&Indirect Instruction Injection&Adversarial Perturbations |
23.07 |
UNC Chapel Hill, Google DeepMind, ETH Zurich |
AdvML Frontiers Workshop 2023 |
Backdoor Attacks for In-Context Learning with Language Models |
Backdoor Attacks&In-Context Learning |
23.07 |
Google DeepMind |
arXiv |
Large language models (LLMs) are now highly capable at a diverse range of tasks |
Adversarial Machine Learning&AI-Guardian&Defense Robustness |
23.08 |
CISPA Helmholtz Center for Information Security; NetApp |
arxiv |
“Do Anything Now”: Characterizing and Evaluating In-The-Wild Jailbreak Prompts on Large Language Models |
Jailbreak Prompts&Adversarial Prompts&Proactive Detection |
23.09 |
Ben-Gurion University, DeepKeep |
arxiv |
OPEN SESAME! UNIVERSAL BLACK BOX JAILBREAKING OF LARGE LANGUAGE MODELS |
Genetic Algorithm&Adversarial Prompt&Black Box Jailbreak |
23.10 |
Princeton University, Virginia Tech, IBM Research, Stanford University |
arxiv |
FINE-TUNING ALIGNED LANGUAGE MODELS COMPROMISES SAFETY EVEN WHEN USERS DO NOT INTEND TO! |
Fine-tuning****Safety Risks&Adversarial Training |
23.10 |
University of California Santa Barbara, Fudan University, Shanghai AI Laboratory |
arxiv |
SHADOW ALIGNMENT: THE EASE OF SUBVERTING SAFELY-ALIGNED LANGUAGE MODELS |
AI Safety&Malicious Use&Fine-tuning |
23.10 |
Peking University |
arxiv |
Jailbreak and Guard Aligned Language Models with Only Few In-Context Demonstrations |
In-Context Learning&Adversarial Attacks&In-Context Demonstrations |
23.10 |
University of Pennsylvania |
arxiv |
Jailbreaking Black Box Large Language Models in Twenty Queries |
Prompt Automatic Iterative Refinement&Jailbreak |
23.10 |
University of Maryland College Park, Adobe Research |
arxiv |
AutoDAN: Interpretable Gradient-Based Adversarial Attacks on Large Language Models |
Adversarial Attacks&Interpretabilty&Jailbreaking |
23.11 |
MBZUAI |
arxiv |
Robustness Tests for Automatic Machine Translation Metrics with Adversarial Attacks |
Adversarially-synthesized Texts&Word-level Attacks&Evaluation |
23.11 |
Palisade Research |
arxiv |
BadLlama: cheaply removing safety fine-tuning from Llama 2-Chat 13B |
Remove Safety Fine-tuning |
23.11 |
University of Twente |
ICNLSP 2023 |
Efficient Black-Box Adversarial Attacks on Neural Text Detectors |
Misclassification&Adversarial attacks |
23.11 |
PRISM AI&Harmony Intelligenc&Leap Laboratories |
arxiv |
Scalable and Transferable Black-Box Jailbreaks for Language Models via Persona Modulation |
Persona-modulation Attacks&Jailbreaks&Automated Prompt |
23.11 |
Tsinghua University |
arxiv |
Jailbreaking Large Vision-Language Models via Typographic Visual Prompts |
Typographic Attack&Multi-modal&Safety Evaluation |
23.11 |
Huazhong University of Science and Technology, Tsinghua University |
arxiv |
Practical Membership Inference Attacks against Fine-tuned Large Language Models via Self-prompt Calibration |
Membership Inference Attacks&Privacy and Security |
23.11 |
Nanjing University, Meituan Inc |
arxiv |
A Wolf in Sheep’s Clothing: Generalized Nested Jailbreak Prompts can Fool Large Language Models Easily |
Jailbreak Prompts&Safety Alignment&Safeguard Effectiveness |
23.11 |
Google DeepMind |
arxiv |
Frontier Language Models Are Not Robust to Adversarial Arithmetic or "What Do I Need To Say So You Agree 2+2=5?" |
Adversarial Arithmetic&Model Robustness&Adversarial Attacks |
23.11 |
University of Illinois Chicago, Texas A&M University |
arxiv |
DALA: A Distribution-Aware LoRA-Based Adversarial Attack against Pre-trained Language Models |
Adversarial Attack&Distribution-Aware&LoRA-Based Attack |
23.11 |
Illinois Institute of Technology |
arxiv |
Backdoor Activation Attack: Attack Large Language Models using Activation Steering for Safety-Alignment |
Backdoor Activation Attack&Large Language Models&AI Safety&Activation Steering&Trojan Steering Vectors |
23.11 |
Wayne State University |
arXiv |
Hijacking Large Language Models via Adversarial In-Context Learning |
Adversarial Attacks&Gradient-Based Prompt Search&Adversarial Suffixes |
23.11 |
Hong Kong Baptist University, Shanghai Jiao Tong University, Shanghai AI Laboratory, The University of Sydney |
arXiv |
DeepInception: Hypnotize Large Language Model to Be Jailbreaker |
Jailbreak&DeepInception |
23.11 |
Xi’an Jiaotong-Liverpool University |
arxiv |
Generating Valid and Natural Adversarial Examples with Large Language Models |
Adversarial examples&Text classification |
23.11 |
Michigan State University |
arxiv |
Beyond Boundaries: A Comprehensive Survey of Transferable Attacks on AI Systems |
Transferable Attacks&AI Systems&Adversarial Attacks |
23.11 |
Tsinghua University & Kuaishou Technology |
arxiv |
Evil Geniuses: Delving into the Safety of LLM-based Agents |
LLM-based Agents&Safety&Malicious Attacks |
23.11 |
Cornell University |
arxiv |
Language Model Inversion |
Model Inversion&Prompt Reconstruction&Privacy |
23.11 |
ETH Zurich |
arxiv |
Universal Jailbreak Backdoors from Poisoned Human Feedback |
RLHF&Backdoor Attacks |
23.11 |
UC Santa Cruz, UNC-Chapel Hill |
arxiv |
How Many Are in This Image? A Safety Evaluation Benchmark for Vision LLMs |
Vision Large Language Models&Safety Evaluation&Adversarial Robustness |
23.11 |
Texas Tech University |
arxiv |
Exploiting Large Language Models (LLMs) through Deception Techniques and Persuasion Principles |
Social Engineering&Security&Prompt Engineering |
23.11 |
Johns Hopkins University |
arxiv |
Instruct2Attack: Language-Guided Semantic Adversarial Attacks |
Language-guided Attacks&Latent Diffusion Models&Adversarial Attack |
23.11 |
Google DeepMind, University of Washington, Cornell, CMU, UC Berkeley, ETH Zurich |
arxiv |
Scalable Extraction of Training Data from (Production) Language Models |
Extractable Memorization&Data Extraction&Adversary Attacks |
23.11 |
University of Maryland, Mila, Towards AI, Stanford, Technical University of Sofia, University of Milan, NYU |
arxiv |
Ignore This Title and HackAPrompt: Exposing Systemic Vulnerabilities of LLMs through a Global Scale Prompt Hacking Competition |
Prompt Hacking&Security Threats |
23.11 |
University of Washington, UIUC, Pennsylvania State University, University of Chicago |
arxiv |
IDENTIFYING AND MITIGATING VULNERABILITIES IN LLM-INTEGRATED APPLICATIONS |
LLM-Integrated Applications&Attack Surfaces |
23.11 |
Jinan University, Guangzhou Xuanyuan Research Institute Co. Ltd., The Hong Kong Polytechnic University |
arxiv |
TARGET: Template-Transferable Backdoor Attack Against Prompt-based NLP Models via GPT4 |
Prompt-based Learning&Backdoor Attack |
23.11 |
Nanjing University&Meituan Inc. |
NAACL2024 |
A Wolf in Sheep’s Clothing: Generalized Nested Jailbreak Prompts can Fool Large Language Models Easily |
Jailbreak Prompts&LLM Security&Automated Framework |
23.11 |
University of Southern California |
NAACL2024(findings) |
Cognitive Overload: Jailbreaking Large Language Models with Overloaded Logical Thinking |
Jailbreaking&Large Language Models&Cognitive Overload |
23.12 |
The Pennsylvania State University |
arxiv |
Stealthy and Persistent Unalignment on Large Language Models via Backdoor Injections |
Backdoor Injection&Safety Alignment |
23.12 |
Drexel University |
arXiv |
A Survey on Large Language Model (LLM) Security and Privacy: The Good the Bad and the Ugly |
Security&Privacy&Attacks |
23.12 |
Yale University, Robust Intelligence |
arXiv |
Tree of Attacks: Jailbreaking Black-Box LLMs Automatically |
Tree of Attacks with Pruning (TAP)&Jailbreaking&Prompt Generation |
23.12 |
Independent (Now at Google DeepMind) |
arXiv |
Scaling Laws for Adversarial Attacks on Language Model Activations |
Adversarial Attacks&Language Model Activations&Scaling Laws |
23.12 |
Harbin Institute of Technology |
arxiv |
Analyzing the Inherent Response Tendency of LLMs: Real-World Instructions-Driven Jailbreak |
Jailbreak Attack&Inherent Response Tendency&Affirmation Tendency |
23.12 |
University of Wisconsin-Madison |
arxiv |
DeceptPrompt: Exploiting LLM-driven Code Generation via Adversarial Natural Language Instructions |
Code Generation&Adversarial Attacks&Cybersecurity |
23.12 |
Carnegie Melon University, IBM Research |
arxiv |
Forcing Generative Models to Degenerate Ones: The Power of Data Poisoning Attacks |
Data Poisoning Attacks&Natural Language Generation&Cybersecurity |
23.12 |
Purdue University |
NIPS2023(Workshop) |
Make Them Spill the Beans! Coercive Knowledge Extraction from (Production) LLMs |
Knowledge Extraction&Interrogation Techniques&Cybersecurity |
23.12 |
Sungkyunkwan University, University of Tennessee |
arXiv |
Poisoned ChatGPT Finds Work for Idle Hands: Exploring Developers’ Coding Practices with Insecure Suggestions from Poisoned AI Models |
Poisoning Attacks&Software Development |
23.12 |
North Carolina State University, New York University, Stanford University |
arXiv |
BEYOND GRADIENT AND PRIORS IN PRIVACY ATTACKS: LEVERAGING POOLER LAYER INPUTS OF LANGUAGE MODELS IN FEDERATED LEARNING |
Federated Learning&Privacy Attacks |
23.12 |
Korea Advanced Institute of Science, Graduate School of AI |
arxiv |
Hijacking Context in Large Multi-modal Models |
Large Multi-modal Models&Context Hijacking |
23.12 |
Xi’an Jiaotong University, Nanyang Technological University, Singapore Management University |
arXiv |
A Mutation-Based Method for Multi-Modal Jailbreaking Attack Detection |
Jailbreaking Detection&Multi-Modal |
23.12 |
Logistic and Supply Chain MultiTech R&D Centre (LSCM) |
UbiSec-2023 |
A Comprehensive Survey of Attack Techniques Implementation and Mitigation Strategies in Large Language Models |
Cybersecurity Attacks&Defense Strategies |
23.12 |
University of Illinois Urbana-Champaign, VMware Research |
arXiv |
BYPASSING THE SAFETY TRAINING OF OPEN-SOURCE LLMS WITH PRIMING ATTACKS |
Safety Training&Priming Attacks |
23.12 |
Delft University of Technology |
ICSE 2024 |
Traces of Memorisation in Large Language Models for Code |
Code Memorisation&Data Extraction Attacks |
23.12 |
University of Science and Technology of China, Hong Kong University of Science and Technology, Microsoft |
arxiv |
Benchmarking and Defending Against Indirect Prompt Injection Attacks on Large Language Models |
Indirect Prompt Injection Attacks&BIPIA Benchmark&Defense |
23.12 |
Nanjing University of Aeronautics and Astronautics |
NLPCC2023 |
Punctuation Matters! Stealthy Backdoor Attack for Language Models |
Backdoor Attack&PuncAttack&Stealthiness |
23.12 |
FAR AI, McGill University, MILA, Jagiellonian University |
arXiv |
Exploiting Novel GPT-4 APIs |
Fine-Tuning&Knowledge Retrieval&Security Vulnerabilities |
23.12 |
EPFL |
|
Adversarial Attacks on GPT-4 via Simple Random Search |
Adversarial Attacks&Random Search&Jailbreak |
24.01 |
Logistic and Supply Chain MultiTech R&D Centre (LSCM) |
CSDE2023 |
A Novel Evaluation Framework for Assessing Resilience Against Prompt Injection Attacks in Large Language Models |
Evaluation&Prompt Injection&Cyber Security |
24.01 |
University of Southern California |
arxiv |
The Butterfly Effect of Altering Prompts: How Small Changes and Jailbreaks Affect Large Language Model Performance |
Prompt Engineering&Text Classification&Jailbreaks |
24.01 |
Virginia Tech, Renmin University of China, UC Davis, Stanford University |
arxiv |
How Johnny Can Persuade LLMs to Jailbreak Them: Rethinking Persuasion to Challenge AI Safety by Humanizing LLMs |
AI Safety&Persuasion Adversarial Prompts&Jailbreak |
24.01 |
Anthropic, Redwood Research, Mila Quebec AI Institute, University of Oxford |
arxiv |
SLEEPER AGENTS: TRAINING DECEPTIVE LLMS THAT PERSIST THROUGH SAFETY TRAINING |
Deceptive Behavior&Safety Training&Backdoored Behavior&Adversarial Training |
24.01 |
Jinan University,Nanyang Technological University, Beijing Institute of Technology, Pazhou Lab |
arxiv |
UNIVERSAL VULNERABILITIES IN LARGE LANGUAGE MODELS: IN-CONTEXT LEARNING BACKDOOR ATTACKS |
In-context Learning&Security&Backdoor Attacks |
24.01 |
Carnegie Mellon University |
arxiv |
Combating Adversarial Attacks with Multi-Agent Debate |
Adversarial Attacks&Multi-Agent Debate&Red Team |
24.01 |
Fudan University |
arxiv |
Open the Pandora’s Box of LLMs: Jailbreaking LLMs through Representation Engineering |
LLM Security&Representation Engineering |
24.01 |
Northwestern University, New York University, University of Liverpool, Rutgers University |
arxiv |
AttackEval: How to Evaluate the Effectiveness of Jailbreak Attacking on Large Language Models |
Jailbreak Attack&Evaluation Frameworks&Ground Truth Dataset |
24.01 |
Kyushu Institute of Technology |
arxiv |
ALL IN HOW YOU ASK FOR IT: SIMPLE BLACK-BOX METHOD FOR JAILBREAK ATTACKS |
Jailbreak Attacks&Black-box Method |
24.01 |
MIT |
arXiv |
Pruning for Protection: Increasing Jailbreak Resistance in Aligned LLMs Without Fine-Tuning |
Jailbreaking&Model Safety |
24.01 |
Aalborg University |
arxiv |
Text Embedding Inversion Attacks on Multilingual Language Models |
Text Embedding&Inversion Attacks&Multilingual Language Models |
24.01 |
University of Illinois Urbana-Champaign, University of Washington, Western Washington University |
arxiv |
BADCHAIN: BACKDOOR CHAIN-OF-THOUGHT PROMPTING FOR LARGE LANGUAGE MODELS |
Chain-of-Thought Prompting&Backdoor Attacks |
24.01 |
The University of Hong Kong, Zhejiang University |
arxiv |
Red Teaming Visual Language Models |
Vision-Language Models&Red Teaming |
24.01 |
University of California Santa Barbara,Sea AI Lab Singapore, Carnegie Mellon University |
arxiv |
Weak-to-Strong Jailbreaking on Large Language Models |
Jailbreaking&Adversarial Prompts&AI Safety |
24.02 |
Boston University |
arxiv |
Vision-LLMs Can Fool Themselves with Self-Generated Typographic Attacks |
Large Vision-Language Models&Typographic Attacks&Self-Generated Attacks |
24.02 |
Copenhagen Business School, Temple University |
arxiv |
An Early Categorization of Prompt Injection Attacks on Large Language Models |
Prompt Injection&Categorization |
24.02 |
Michigan State University, Okinawa Institute of Science and Technology (OIST) |
arxiv |
Data Poisoning for In-context Learning |
In-context learning&Data poisoning&Security |
24.02 |
CISPA Helmholtz Center for Information Security |
arxiv |
Conversation Reconstruction Attack Against GPT Models |
Conversation Reconstruction Attack&Privacy risks&Security |
24.02 |
University of Illinois Urbana-Champaign, Center for AI Safety, Carnegie Mellon University, UC Berkeley, Microsoft |
arxiv |
HarmBench: A Standardized Evaluation Framework for Automated Red Teaming and Robust Refusal |
Automated Red Teaming&Robust Refusal |
24.02 |
University of Washington, University of Virginia, Allen Institute for Artificial Intelligence |
arxiv |
Do Membership Inference Attacks Work on Large Language Models? |
Membership Inference Attacks&Privacy&Security |
24.02 |
Pennsylvania State University, Wuhan University, Illinois Institute of Technology |
arxiv |
PoisonedRAG: Knowledge Poisoning Attacks to Retrieval-Augmented Generation of Large Language Models |
Knowledge Poisoning Attacks&Retrieval-Augmented Generation |
24.02 |
Purdue University, University of Massachusetts at Amherst |
arxiv |
RAPID OPTIMIZATION FOR JAILBREAKING LLMS VIA SUBCONSCIOUS EXPLOITATION AND ECHOPRAXIA |
Jailbreaking LLM&Optimization |
24.02 |
CISPA Helmholtz Center for Information Security |
arxiv |
Comprehensive Assessment of Jailbreak Attacks Against LLMs |
Jailbreak Attacks&Attack Methods&Policy Alignment |
24.02 |
UC Berkeley |
arxiv |
StruQ: Defending Against Prompt Injection with Structured Queries |
Prompt Injection Attacks&Structured Queries&Defense Mechanisms |
24.02 |
Nanyang Technological University, Huazhong University of Science and Technology, University of New South Wales |
arxiv |
PANDORA: Jailbreak GPTs by Retrieval Augmented Generation Poisoning |
Jailbreak Attacks&Retrieval Augmented Generation (RAG) |
24.02 |
Sea AI Lab, Southern University |
arxiv |
Test-Time Backdoor Attacks on Multimodal Large Language Models |
Backdoor Attacks&Multimodal Large Language Models (MLLMs)&Adversarial Test Images |
24.02 |
University of Illinois at Urbana–Champaign, University of California, San Diego, Allen Institute for AI |
arxiv |
COLD-Attack: Jailbreaking LLMs with Stealthiness and Controllability |
Jailbreaks&Controllable Attack Generation |
24.02 |
ISCAS, NTU |
arxiv |
Play Guessing Game with LLM: Indirect Jailbreak Attack with Implicit Clues |
Jailbreak Attacks&Indirect Attack&Puzzler |
24.02 |
École Polytechnique Fédérale de Lausanne, University of Wisconsin-Madison |
arxiv |
Leveraging the Context through Multi-Round Interactions for Jailbreaking Attacks |
Jailbreaking Attacks&Contextual Interaction&Multi-Round Interactions |
24.02 |
University of Electronic Science and Technology of China, CISPA Helmholtz Center for Information Security, NetApp |
arxiv |
Rapid Adoption, Hidden Risks: The Dual Impact of Large Language Model Customization |
Customization&Instruction Backdoor Attacks&GPTs |
24.02 |
Shanghai Artificial Intelligence Laboratory |
arxiv |
Attacks, Defenses and Evaluations for LLM Conversation Safety: A Survey |
LLM Conversation Safety&Attacks&Defenses |
24.02 |
UC Berkeley, New York University |
arxiv |
PAL: Proxy-Guided Black-Box Attack on Large Language Models |
Black-Box Attack&Proxy-Guided Attack&PAL |
24.02 |
Center for Human-Compatible AI, UC Berkeley |
arxiv |
A STRONGREJECT for Empty Jailbreaks |
Jailbreaks&Benchmarking&StrongREJECT |
24.02 |
Arizona State University |
arxiv |
Jailbreaking Proprietary Large Language Models using Word Substitution Cipher |
Jailbreak&Word Substitution Cipher&Attack Success Rate |
24.02 |
Renmin University of China, Beijing, Peking University, WeChat AI |
arxiv |
Watch Out for Your Agents! Investigating Backdoor Threats to LLM-Based Agents |
Backdoor Attacks&Agent Safety&Framework |
24.02 |
University of Washington, UIUC, Western Washington University, University of Chicago |
arxiv |
ArtPrompt: ASCII Art-based Jailbreak Attacks against Aligned LLMs |
ASCII Art&Jailbreak Attacks&Safety Alignment |
24.02 |
Jinan University, Nanyang Technological University, Zhejiang University, Hong Kong University of Science and Technology, Beijing Institute of Technology, Sony Research |
arxiv |
Defending Against Weight-Poisoning Backdoor Attacks for Parameter-Efficient Fine-Tuning |
Weight-Poisoning Backdoor Attacks&Parameter-Efficient Fine-Tuning (PEFT)&Poisoned Sample Identification Module (PSIM) |
24.02 |
CISPA Helmholtz Center for Information Security |
arxiv |
Prompt Stealing Attacks Against Large Language Models |
Prompt Engineering&Security |
24.02 |
University of New South Wales Australia, Delft University of Technology The Netherlands&Nanyang Technological University Singapore |
arxiv |
LLM Jailbreak Attack versus Defense Techniques - A Comprehensive Study |
Jailbreak Attacks&Defense Techniques |
24.02 |
Wayne State University, University of Michigan-Flint |
arxiv |
Learning to Poison Large Language Models During Instruction Tuning |
Data Poisoning&Backdoor Attacks |
24.02 |
Nanyang Technological University, Zhejiang University, The Chinese University of Hong Kong |
arxiv |
Backdoor Attacks on Dense Passage Retrievers for Disseminating Misinformation |
Dense Passage Retrieval&Backdoor Attacks&Misinformation |
24.02 |
University of Michigan |
arxiv |
PRP: Propagating Universal Perturbations to Attack Large Language Model Guard-Rails |
Universal Adversarial Prefixes&Guard Models |
24.02 |
Meta |
arxiv |
Rainbow Teaming: Open-Ended Generation of Diverse Adversarial Prompts |
Adversarial Prompts&Quality-Diversity&Safety |
24.02 |
Fudan University |
arxiv |
CodeChameleon: Personalized Encryption Framework for Jailbreaking Large Language Models |
Personalized Encryption&Safety Mechanisms |
24.02 |
Carnegie Mellon University |
arxiv |
Attacking LLM Watermarks by Exploiting Their Strengths |
LLM Watermarks&Adversarial Attacks |
24.02 |
Beihang University |
arxiv |
From Noise to Clarity: Unraveling the Adversarial Suffix of Large Language Model Attacks via Translation of Text Embeddings |
Adversarial Suffix&Text Embedding Translation |
24.02 |
University of Maryland College Park |
arxiv |
Fast Adversarial Attacks on Language Models In One GPU Minute |
Adversarial Attacks&BEAST&Computational Efficiency |
24.02 |
Shanghai Artificial Intelligence Laboratory |
arxiv |
Attacks Defenses and Evaluations for LLM Conversation Safety: A Survey |
Conversation Safety&Survey |
24.02 |
Beijing University of Posts and Telecommunications, University of Michigan |
arxiv |
Speak Out of Turn: Safety Vulnerability of Large Language Models in Multi-turn Dialogue |
Multi-turn Dialogue&Safety Vulnerability |
24.02 |
University of California, The Hongkong University of Science and Technology, University of Maryland |
arxiv |
DrAttack: Prompt Decomposition and Reconstruction Makes Powerful LLM Jailbreakers |
Jailbreaking Attacks&Prompt Decomposition |
24.02 |
Massachusetts Institute of Technology, MIT-IBM Watson AI Lab |
arxiv |
CURIOSITY-DRIVEN RED-TEAMING FOR LARGE LANGUAGE MODELS |
Curiosity-Driven Exploration&Red Teaming |
24.02 |
SKLOIS Institute of Information Engineering Chinese Academy of Science, School of Cyber Security University of Chinese Academy of Sciences,Tsinghua University,RealAI |
arxiv |
Making Them Ask and Answer: Jailbreaking Large Language Models in Few Queries via Disguise and Reconstruction |
Jailbreaking&Large Language Models&Adversarial Attacks |
24.03 |
Rice University, Samsung Electronics America |
arxiv |
LoRA-as-an-Attack! Piercing LLM Safety Under The Share-and-Play Scenario |
Low-Rank Adaptation (LoRA)&Backdoor Attacks&Model Security |
24.03 |
The University of Hong Kong |
arxiv |
ImgTrojan: Jailbreaking Vision-Language Models with ONE Image |
Vision-Language Models&Data Poisoning&Jailbreaking Attack |
24.03 |
SPRING Lab EPFL |
arxiv |
Neural Exec: Learning (and Learning from) Execution Triggers for Prompt Injection Attacks |
Prompt Injection Attacks&Optimization-Based Approach&Security |
24.03 |
Shanghai University of Finance and Economics, Southern University of Science and Technology |
arxiv |
Tastle: Distract Large Language Models for Automatic Jailbreak Attack |
Jailbreak Attack&Black-box Framework |
24.03 |
Google DeepMind, ETH Zurich, University of Washington, OpenAI, McGill University |
arxiv |
Stealing Part of a Production Language Model |
Model Stealing&Language Models&Security |
24.03 |
University of Edinburgh |
arxiv |
Scaling Behavior of Machine Translation with Large Language Models under Prompt Injection Attacks |
Prompt Injection Attacks&Machine Translation&Inverse Scaling |
24.03 |
Nanyang Technological University |
arxiv |
BADEDIT: BACKDOORING LARGE LANGUAGE MODELS BY MODEL EDITING |
Backdoor Attacks&Model Editing&Security |
24.03 |
Fudan University, Shanghai AI Laboratory |
arxiv |
EasyJailbreak: A Unified Framework for Jailbreaking Large Language Models |
Jailbreak Attacks&Security&Framework |
24.03 |
Shanghai Key Lab of Intelligent Information Processing, Fudan University, Shanghai Engineering Research Center of AI & Robotics |
arxiv |
Improving Adversarial Transferability of Visual-Language Pre-training Models through Collaborative Multimodal Interaction |
Vision-Language Pre-trained Model&Adversarial Transferability&Black-Box Attack |
24.03 |
Microsoft |
arxiv |
Securing Large Language Models: Threats, Vulnerabilities, and Responsible Practices |
Security Risks&Vulnerabilities |
24.03 |
Carnegie Mellon University |
arxiv |
Jailbreaking is Best Solved by Definition |
Jailbreak Attacks&Adaptive Attacks |
24.03 |
ShanghaiTech University |
arxiv |
LinkPrompt: Natural and Universal Adversarial Attacks on Prompt-based Language Models |
Universal Adversarial Triggers&Prompt-based Learning&Natural Language Attack |
24.03 |
Huazhong University of Science and Technology, Lehigh University, University of Notre Dame & Duke University |
arxiv |
Optimization-based Prompt Injection Attack to LLM-as-a-Judge |
Prompt Injection Attack&LLM-as-a-Judge&Optimization |
24.03 |
Washington University in St. Louis, University of Wisconsin - Madison, John Burroughs School |
USENIX Security 2024 |
Don’t Listen To Me: Understanding and Exploring Jailbreak Prompts of Large Language Models |
Jailbreak Prompts&Security |
24.03 |
School of Information Science and Technology, ShanghaiTech University |
NAACL2024 |
LinkPrompt: Natural and Universal Adversarial Attacks on Prompt-based Language Models |
Prompt-based Language Models&Universal Adversarial Triggers&Natural Language Attacks |
24.04 |
University of Pennsylvania, ETH Zurich, EPFL, Sony AI |
arxiv |
JailbreakBench: An Open Robustness Benchmark for Jailbreaking Large Language Models |
Jailbreaking Attacks&Robustness Benchmark |
24.04 |
Microsoft Azure, Microsoft, Microsoft Research |
arxiv |
The Crescendo Multi-Turn LLM Jailbreak Attack |
Jailbreak Attacks&Multi-Turn Interaction |
24.04 |
EPFL |
arxiv |
Jailbreaking Leading Safety-Aligned LLMs with Simple Adaptive Attacks |
Adaptive Attacks&Jailbreaking |
24.04 |
The Ohio State University, University of Wisconsin-Madison |
arxiv |
JailBreakV-28K: A Benchmark for Assessing the Robustness of MultiModal Large Language Models against Jailbreak Attacks |
Multimodal Large Language Models&Jailbreak Attacks&Benchmark |
24.04 |
Enkrypt AI |
arxiv |
INCREASED LLM VULNERABILITIES FROM FINE-TUNING AND QUANTIZATION |
Fine-tuning&Quantization&LLM Vulnerabilities |
24.04 |
The Pennsylvania State University, Carnegie Mellon University |
arxiv |
Hidden You Malicious Goal Into Benigh Narratives: Jailbreak Large Language Models through Logic Chain Injection |
LLM&Jailbreak&Prompt Injection |
24.04 |
Technical University of Darmstadt, Google Research |
arxiv |
Best-of-Venom: Attacking RLHF by Injecting Poisoned Preference Data |
Reinforcement Learning from Human Feedback&Poisoned Preference Data&Language Model Security |
24.04 |
Purdue University |
arxiv |
Rethinking How to Evaluate Language Model Jailbreak |
Jailbreak&Evaluation Metrics |
24.04 |
Xi’an Jiaotong-Liverpool University, Rutgers University, University of Liverpool |
arxiv |
Goal-guided Generative Prompt Injection Attack on Large Language Models |
Prompt Injection&Robustness&Mahalanobis Distance |
24.04 |
University of New Haven |
arxiv |
SANDWICH ATTACK: MULTI-LANGUAGE MIXTURE ADAPTIVE ATTACK ON LLMS |
Multi-language Mixture&Adaptive Attack&LLM Security |
24.04 |
The Ohio State University |
arxiv |
AmpleGCG: Learning a Universal and Transferable Generative Model of Adversarial Suffixes for Jailbreaking Both Open and Closed LLMs |
Adversarial Suffix Generation |
24.04 |
Renmin University of China, Microsoft Research |
arxiv |
Uncovering Safety Risks in Open-source LLMs through Concept Activation Vector |
Safety&Attack Methods |
24.04 |
Zhejiang University |
arxiv |
JailbreakLens: Visual Analysis of Jailbreak Attacks Against Large Language Models |
Jailbreak Attacks&Visual Analytics |
24.04 |
Shanghaitech University |
arxiv |
Don’t Say No: Jailbreaking LLM by Suppressing Refusal |
Jailbreaking Attacks&Adversarial Attacks |
24.04 |
University of Electronic Science and Technology of China, Chengdu University of Technology |
arxiv |
TALK TOO MUCH: Poisoning Large Language Models under Token Limit |
Token Limitation&Poisoning Attack |
24.04 |
ETH Zurich, EPFL, University of Twente, Georgia Institute of Technology |
arxiv |
Competition Report: Finding Universal Jailbreak Backdoors in Aligned LLMs |
Aligned LLMs&Universal Jailbreak Backdoors&Poisoning Attacks |
24.04 |
N/A |
arxiv |
Trojan Detection in Large Language Models: Insights from The Trojan Detection Challenge |
Trojan Detection&Model Robustness |
24.04 |
Shanghai Jiao Tong University |
arxiv |
Physical Backdoor Attack can Jeopardize Driving with Vision-Large-Language Models |
Vision-Large-Language Models&Autonomous Driving&Security |
24.04 |
Max-Planck-Institute for Intelligent Systems, AI at Meta (FAIR) |
arxiv |
AdvPrompter: Fast Adaptive Adversarial Prompting for LLMs |
Adversarial Prompting&Safety in AI |
24.04 |
Singapore Management University, Shanghai Institute for Advanced Study of Zhejiang University |
arxiv |
Evaluating and Mitigating Linguistic Discrimination in Large Language Models |
=Linguistic Discrimination&Jailbreak&Defense |
24.04 |
University of Louisiana at Lafayette, Beijing Electronic Science and Technology Institute, The Johns Hopkins University |
arxiv |
Assessing Cybersecurity Vulnerabilities in Code Large Language Models |
Code LLMs |
24.04 |
University College London, The University of Melbourne, Macquarie University, University of Edinburgh |
arxiv |
Transferring Troubles: Cross-Lingual Transferability of Backdoor Attacks in LLMs with Instruction Tuning |
Cross-Lingual Transferability&Backdoor Attacks&Instruction Tuning |
24.04 |
University of Cambridge, Indian Institute of Technology Bombay, University of Melbourne, University College London, Macquarie University |
ICLR 2024 Workshop |
Attacks on Third-Party APIs of Large Language Models |
Third-Party API&Security |
24.04 |
Purdue University, Fort Wayne |
NAACL2024 |
Vert Attack: Taking advantage of Text Classifiers’ horizontal vision |
Text Classifiers&Adversarial Attacks&VertAttack |
24.04 |
The University of Melbourne&Macquarie University&University College London |
NAACL2024 |
Backdoor Attacks on Multilingual Machine Translation |
Multilingual Machine Translation&Security&Backdoor Attacks |
24.05 |
Institute of Information Engineering, Chinese Academy of Sciences |
arxiv |
Can LLMs Deeply Detect Complex Malicious Queries? A Framework for Jailbreaking via Obfuscating Intent |
Prompt Jailbreak Attack&Red Team&Black-box Attack |
24.05 |
University of Texas at Austin |
arxiv |
Mitigating Exaggerated Safety in Large Language Models |
Model Safety&Utility&Exaggerated Safety |
24.05 |
Institute of Information Engineering, Chinese Academy of Sciences |
arxiv |
Chain of Attack: A Semantic-Driven Contextual Multi-Turn Attacker for LLM |
Multi-Turn Dialogue Attack&LLM Security&Semantic-Driven Contextual Attack |
24.05 |
Peking University |
ICLR 2024 Workshop |
BOOSTING JAILBREAK ATTACK WITH MOMENTUM |
Jailbreak Attack&Momentum Method |
24.05 |
École Polytechnique Fédérale de Lausanne |
ICML 2024 |
Revisiting character-level adversarial attacks |
Character-level Adversarial Attack&Robustness |
24.05 |
Johns Hopkins University |
CCS 2024 |
PLeak: Prompt Leaking Attacks against Large Language Model Applications |
Prompt Leaking Attacks&Adversarial Queries |
24.05 |
IT University of Copenhagen |
arxiv |
Hacc-Man: An Arcade Game for Jailbreaking LLMs |
creative problem solving&jailbreaking |
24.05 |
The Hong Kong Polytechnic University |
arxiv |
No Two Devils Alike: Unveiling Distinct Mechanisms of Fine-tuning Attacks |
Fine-tuning Attacks&LLM Safeguarding&Mechanistic Interpretability |
24.05 |
KAIST |
arxiv |
Automatic Jailbreaking of the Text-to-Image Generative AI Systems |
Jailbreaking&Text-to-Image&Generative AI |
24.05 |
Fudan University |
arxiv |
White-box Multimodal Jailbreaks Against Large Vision-Language Models |
Fine-tuning Attacks&Multimodal Models&Adversarial Robustness |
24.05 |
Singapore Management University |
arxiv |
Defending Large Language Models Against Jailbreak Attacks via Layer-specific Editing |
Jailbreak Attacks&Layer-specific Editing&LLM Safeguarding |
24.05 |
Mila – Québec AI Institute |
arxiv |
Learning Diverse Attacks on Large Language Models for Robust Red-Teaming and Safety Tuning |
Red-Teaming&Safety Tuning&GFlowNet Fine-tuning |
24.05 |
CISPA Helmholtz Center for Information Security |
arxiv |
Voice Jailbreak Attacks Against GPT-4o |
Jailbreak Attacks&Voice Mode&GPT-4o |
24.05 |
Nanyang Technological University |
arxiv |
ART: Automatic Red-teaming for Text-to-Image Models to Protect Benign Users |
Red-Teaming&Text-to-Image Models&Generative AI Safety |
24.05 |
Xidian University |
arxiv |
Efficient LLM-Jailbreaking by Introducing Visual Modality |
Jailbreaking&Multimodal Models&Visual Modality |
24.05 |
Institute of Information Engineering, Chinese Academy of Sciences |
arxiv |
Context Injection Attacks on Large Language Models |
Context Injection Attacks&Misleading Context |
24.05 |
University of Illinois at Urbana-Champaign |
arxiv |
Jailbreaking Large Language Models Against Moderation Guardrails via Cipher Characters |
Jailbreak&Moderation Guardrails&Cipher Characters |
24.05 |
Northeastern University |
arxiv |
Phantom: General Trigger Attacks on Retrieval Augmented Language Generation |
Trigger Attacks&Retrieval Augmented Generation&Poisoning |
24.05 |
Northwestern University |
arxiv |
Enhancing Jailbreak Attack Against Large Language Models through Silent Tokens |
Jailbreak Attack&Silent Tokens |
24.05 |
Peking University |
arxiv |
Visual-RolePlay: Universal Jailbreak Attack on MultiModal Large Language Models via Role-playing Image Character |
Jailbreak Attack&MultiModal Large Language Models&Role-playing |
24.05 |
Northwestern University |
arxiv |
Exploring Backdoor Attacks against Large Language Model-based Decision Making |
Backdoor Attacks&Decision Making |
24.05 |
Beihang University |
arxiv |
Cross-Modality Jailbreak and Mismatched Attacks on Medical Multimodal Large Language Models |
Jailbreak&Multimodal Large Language Models&Medical Contexts |
24.05 |
Harbin Institute of Technology |
arxiv |
Improved Generation of Adversarial Examples Against Safety-aligned LLMs |
Adversarial Examples&Safety-aligned LLMs&Gradient-based Methods |
24.05 |
Nanyang Technological University |
arxiv |
Improved Techniques for Optimization-Based Jailbreaking on Large Language Models |
Jailbreaking&Optimization Techniques |
24.06 |
University of Central Florida |
arxiv |
BadRAG: Identifying Vulnerabilities in Retrieval Augmented Generation of Large Language Models |
Retrieval-Augmented Generation&Poisoning Attacks |
24.06 |
Zscaler, Inc. |
arxiv |
Exploring Vulnerabilities and Protections in Large Language Models: A Survey |
Prompt Hacking&Adversarial Attacks&Suvery |
24.06 |
Singapore Management University |
arxiv |
Improved Few-Shot Jailbreaking Can Circumvent Aligned Language Models and Their Defenses |
Few-Shot Jailbreaking&Aligned Language Models&Adversarial Attacks |
24.06 |
Capgemini Invent, Paris |
arxiv |
QROA: A Black-Box Query-Response Optimization Attack on LLMs |
Query-Response Optimization Attack&Black-Box |
24.06 |
Huazhong University of Science and Technology |
arxiv |
AutoJailbreak: Exploring Jailbreak Attacks and Defenses through a Dependency Lens |
Jailbreak Attacks&Dependency Analysis |
24.06 |
Beihang University |
arxiv |
Jailbreak Vision Language Models via Bi-Modal Adversarial Prompt |
Jailbreak Attacks&Vision Language Models&Bi-Modal Adversarial Prompt |
24.06 |
Zhengzhou University |
ACL 2024 |
BadAgent: Inserting and Activating Backdoor Attacks in LLM Agents |
Backdoor Attacks&LLM Agents&Data Poisoning |
24.06 |
Ludwig-Maximilians-University |
arxiv |
Understanding Jailbreak Success: A Study of Latent Space Dynamics in Large Language Models |
Jailbreak Success&Latent Space Dynamics |
24.06 |
Alibaba Group |
arxiv |
How Alignment and Jailbreak Work: Explain LLM Safety through Intermediate Hidden States |
LLM Safety&Alignment&Jailbreak |
24.06 |
Beihang University |
arxiv |
Unveiling the Safety of GPT-4O: An Empirical Study Using Jailbreak Attacks |
GPT-4O&Jailbreak Attacks&Safety Evaluation |
24.06 |
Nanyang Technological University |
arxiv |
A Survey of Backdoor Attacks and Defenses on Large Language Models: Implications for Security Measures |
Backdoor Attacks&Defenses&Survey |
24.06 |
Anomalee Inc. |
arxiv |
On Trojans in Refined Language Models |
Trojans&Refined Language Models&Data Poisoning |
24.06 |
Purdue University |
arxiv |
When LLM Meets DRL: Advancing Jailbreaking Efficiency via DRL-Guided Search |
Jailbreaking&Deep Reinforcement Learning |
24.06 |
Xidian University |
arxiv |
StructuralSleight: Automated Jailbreak Attacks on Large Language Models Utilizing Uncommon Text-Encoded Structure |
Jailbreak Attacks&StructuralSleight |
24.06 |
Tsinghua University |
arxiv |
JailbreakEval: An Integrated Toolkit for Evaluating Jailbreak Attempts Against Large Language Models |
Jailbreak Attempts&Evaluation Toolkit |
24.06 |
The Hong Kong University of Science and Technology (Guangzhou) |
arxiv |
Bag of Tricks: Benchmarking of Jailbreak Attacks on LLMs |
Jailbreak Attacks&Benchmarking |
24.06 |
Pennsylvania State University |
NAACL 2024 |
PromptFix: Few-shot Backdoor Removal via Adversarial Prompt Tuning |
Backdoor Removal&Adversarial Prompt Tuning&Few-shot Learning |
24.06 |
Shanghai Jiao Tong University, Peking University, Shanghai AI Laboratory |
arxiv |
Emerging Safety Attack and Defense in Federated Instruction Tuning of Large Language Models |
Federated Instruction Tuning&Safety Attack&Defense |
24.06 |
Michigan State University |
arxiv |
Towards Understanding Jailbreak Attacks in LLMs: A Representation Space Analysis |
Jailbreak Attacks&Representation Space Analysis |
24.06 |
Chinese Academy of Sciences |
arxiv |
“Not Aligned” is Not “Malicious”: Being Careful about Hallucinations of Large Language Models’ Jailbreak |
Jailbreak&Hallucinations&LLMs |
24.06 |
Tsinghua University |
arxiv |
Knowledge-to-Jailbreak: One Knowledge Point Worth One Attack |
Knowledge-to-Jailbreak&Jailbreak Attacks&Domain-Specific Safety |
24.06 |
University of Maryland |
arxiv |
Is Poisoning a Real Threat to LLM Alignment? Maybe More So Than You Think |
Poisoning Attacks&Direct Policy Optimization&Reinforcement Learning with Human Feedback |
24.06 |
Carnegie Mellon University |
arxiv |
Jailbreak Paradox: The Achilles’ Heel of LLMs |
Jailbreak Paradox&Security |
24.06 |
Carnegie Mellon University |
arxiv |
Adversarial Attacks on Multimodal Agents |
Adversarial Attacks&Multimodal Agents&Vision-Language Models |
24.06 |
University of Washington, Allen Institute for AI |
arxiv |
ChatBug: A Common Vulnerability of Aligned LLMs Induced by Chat Templates |
LLM Vulnerabilities&Jailbreak Attacks&Adversarial Training |
24.06 |
University of Notre Dame, Huazhong University of Science and Technology, Tsinghua University, Lehigh University |
arxiv |
ObscurePrompt: Jailbreaking Large Language Models via Obscure Input |
Jailbreaking&Adversarial Attacks&Out-of-Distribution Data |
24.06 |
The University of Hong Kong, Huawei Noah’s Ark Lab |
arxiv |
Jailbreaking as a Reward Misspecification Problem |
Jailbreaking&Reward Misspecification&Adversarial Attacks |
24.06 |
UC Berkeley |
arxiv |
Adversaries Can Misuse Combinations of Safe Models |
Model Misuse&AI Safety&Task Decomposition |
24.06 |
UC Santa Barbara |
arxiv |
MultiAgent Collaboration Attack: Investigating Adversarial Attacks in Large Language Model Collaborations |
MultiAgent Collaboration&Adversarial Attacks |
24.06 |
University of Southern California |
arxiv |
From LLMs to MLLMs: Exploring the Landscape of Multimodal Jailbreaking |
Multimodal Jailbreaking&MLLMs&Security |
24.06 |
KAIST |
arxiv |
CSRT: Evaluation and Analysis of LLMs using Code-Switching Red-Teaming Dataset |
Code-Switching&Red-Teaming&Multilingualism |
24.06 |
China University of Geosciences |
arxiv |
Large Language Models for Link Stealing Attacks Against Graph Neural Networks |
Link Stealing Attacks&Graph Neural Networks&Privacy Attacks |
24.06 |
The Hong Kong Polytechnic University |
arxiv |
CoSafe: Evaluating Large Language Model Safety in Multi-Turn Dialogue Coreference |
Safety Evaluation&Dialogue Coreference&LLM Safety |
24.06 |
Imperial College London |
arxiv |
Inherent Challenges of Post-Hoc Membership Inference for Large Language Models |
Membership Inference Attacks&Post-Hoc Evaluation&Distribution Shift |
24.06 |
Hubei University |
arxiv |
Poisoned LangChain: Jailbreak LLMs by LangChain |
Jailbreak&Retrieval-Augmented Generation&LangChain |
24.06 |
University of Central Florida |
arxiv |
Jailbreaking LLMs with Arabic Transliteration and Arabizi |
Jailbreaking&Arabic Transliteration&Arabizi |
24.06 |
Hubei University |
TRAC 2024 Workshop |
SEEING IS BELIEVING: BLACK-BOX MEMBERSHIP INFERENCE ATTACKS AGAINST RETRIEVAL AUGMENTED GENERATION |
Membership Inference Attacks&Retrieval-Augmented Generation |
24.06 |
Huazhong University of Science and Technology |
arxiv |
Virtual Context: Enhancing Jailbreak Attacks with Special Token Injection |
Jailbreak Attacks&Special Tokens |
24.06 |
UC Berkeley |
arxiv |
Covert Malicious Finetuning: Challenges in Safeguarding LLM Adaptation |
AI Safety&Backdoors |
24.07 |
University of Illinois Chicago |
arxiv |
Large Language Models Are Involuntary Truth-Tellers: Exploiting Fallacy Failure for Jailbreak Attacks |
Jailbreak Attacks&Fallacious Reasoning |
24.07 |
Palisade Research |
arxiv |
Badllama 3: Removing Safety Finetuning from Llama 3 in Minutes |
Safety Finetuning&Jailbreak Attacks |
24.07 |
University of Illinois Urbana-Champaign |
arxiv |
JailbreakZoo: Survey, Landscapes, and Horizons in Jailbreaking Large Language and Vision-Language Models |
Jailbreaking&Vision-Language Models |
24.07 |
Shanghai University of Finance and Economics |
arxiv |
SoP: Unlock the Power of Social Facilitation for Automatic Jailbreak Attack |
Jailbreak Attacks&Large Language Models&Social Facilitation |
24.07 |
University of Exeter |
arxiv |
Image-to-Text Logic Jailbreak: Your Imagination can Help You Do Anything |
Machine Learning&ICML&Jailbreak Attacks |
24.07 |
Hong Kong University of Science and Technology |
arxiv |
JailbreakHunter: A Visual Analytics Approach for Jailbreak Prompts Discovery from Large-Scale Human-LLM Conversational Datasets |
Visual Analytics&Jailbreak Prompts |
24.07 |
CISPA Helmholtz Center for Information Security |
arxiv |
SOS! Soft Prompt Attack Against Open-Source Large Language Models |
Soft Prompt Attack&Open-Source Models |
24.07 |
National University of Singapore |
arxiv |
Single Character Perturbations Break LLM Alignment |
Jailbreak Attacks&Model Alignment |
24.07 |
Deutsches Forschungszentrum für Künstliche Intelligenz |
arxiv |
Soft Begging: Modular and Efficient Shielding of LLMs against Prompt Injection and Jailbreaking based on Prompt Tuning |
Prompt Injection&Jailbreaking&Soft Prompts |
24.07 |
UC Davis |
arxiv |
Securing Multi-turn Conversational Language Models Against Distributed Backdoor Triggers |
Multi-turn Conversation&Backdoor Triggers&LLM Security |
24.07 |
Tsinghua University |
arxiv |
Jailbreak Attacks and Defenses Against Large Language Models: A Survey |
Jailbreak Attacks&Defenses |
24.07 |
Zhejiang University |
arxiv |
TAPI: Towards Target-Specific and Adversarial Prompt Injection against Code LLMs |
Target-Specific Attacks&Adversarial Prompt Injection&Malicious Code Generation |
24.07 |
Northwestern University |
arxiv |
CEIPA: Counterfactual Explainable Incremental Prompt Attack Analysis on Large Language Models |
Counterfactual Explanation&Prompt Attack Analysis&Incremental Prompt Injection |
24.07 |
EPFL |
arxiv |
Does Refusal Training in LLMs Generalize to the Past Tense? |
Refusal Training&Past Tense Reformulation&Adversarial Attacks |
24.07 |
University of Chicago |
arxiv |
AGENTPOISON: Red-teaming LLM Agents via Poisoning Memory or Knowledge Bases |
Red-teaming&LLM Agents&Poisoning |
24.07 |
Wuhan University |
arxiv |
Black-Box Opinion Manipulation Attacks to Retrieval-Augmented Generation of Large Language Models |
Black-box Attacks&RAG&Opinion Manipulation |
24.07 |
University of New South Wales |
arxiv |
Continuous Embedding Attacks via Clipped Inputs in Jailbreaking Large Language Models |
Continuous Embedding&Jailbreaking |
24.07 |
Bloomberg |
arxiv |
Operationalizing a Threat Model for Red-Teaming Large Language Models (LLMs) |
Threat Model&Red-Teaming |
24.07 |
Stanford University |
arxiv |
When Do Universal Image Jailbreaks Transfer Between Vision-Language Models? |
Universal Image Jailbreaks&Vision-Language Models&Transferability |
24.07 |
Michigan State University |
arxiv |
Intrinsic Self-correction for Enhanced Morality: An Analysis of Internal Mechanisms and the Superficial Hypothesis |
Moral Self-Correction&Intrinsic Mechanisms |
24.07 |
Meetyou AI Lab |
arxiv |
Imposter.AI: Adversarial Attacks with Hidden Intentions towards Aligned Large Language Models |
Adversarial Attacks&Hidden Intentions |
24.07 |
Zhejiang Gongshang University |
arxiv |
Figure it Out: Analyzing-based Jailbreak Attack on Large Language Models |
Jailbreak Attack&Analyzing-based Jailbreak |
24.07 |
Zhejiang University |
arxiv |
RedAgent: Red Teaming Large Language Models with Context-aware Autonomous Language Agent |
Red Teaming&Jailbreak Attacks&Context-aware Prompts |
24.07 |
Confirm Labs |
arxiv |
Fluent Student-Teacher Redteaming |
Fluent Student-Teacher Redteaming&Adversarial Attacks |
24.07 |
City University of Hong Kong |
ACM MM 2024 |
Arondight: Red Teaming Large Vision Language Models with Auto-generated Multi-modal Jailbreak Prompts |
Large Vision Language Model&Red Teaming&Jailbreak Attack |
24.07 |
Huazhong University of Science and Technology |
NAACL 2024 Workshop |
Can Large Language Models Automatically Jailbreak GPT-4V? |
Jailbreak&Multimodal Information&Facial Recognition |
24.07 |
Illinois Institute of Technology |
arxiv |
Can Editing LLMs Inject Harm? |
Knowledge Editing&Misinformation Injection&Bias Injection |
24.07 |
KAIST |
arxiv |
Prompt-Driven Contrastive Learning for Transferable Adversarial Attacks |
Adversarial Attack&Vision-Language Model&Contrastive Learning |
24.07 |
CISPA Helmholtz Center for Information Security |
arxiv |
Breaking Agents: Compromising Autonomous LLM Agents Through Malfunction Amplification |
LLM Agents&Security Vulnerability&Autonomous Systems |
24.08 |
CISPA Helmholtz Center for Information Security |
arxiv |
Vera Verto: Multimodal Hijacking Attack |
Multimodal Hijacking Attack&Model Hijacking |
24.08 |
Shandong University |
arxiv |
Jailbreaking Text-to-Image Models with LLM-Based Agents |
Jailbreak Attacks&Vision-Language Models (VLMs)&Generative AI Safety |
24.08 |
Technological University Dublin |
arxiv |
Pathway to Secure and Trustworthy 6G for LLMs: Attacks, Defense, and Opportunities |
6G Networks&Security&Membership Inference Attacks |
24.08 |
Microsoft |
arxiv |
WHITE PAPER: A Brief Exploration of Data Exfiltration using GCG Suffixes |
Cross-Prompt Injection Attack&Greedy Coordinate Gradient&Data Exfiltration |
24.08 |
NYU & Meta AI, FAIR |
arxiv |
Mission Impossible: A Statistical Perspective on Jailbreaking LLMs |
Jailbreaking&Reinforcement Learning with Human Feedback |
24.08 |
Beihang University |
arxiv |
Compromising Embodied Agents with Contextual Backdoor Attacks |
Embodied Agents&Contextual Backdoor Attacks&Adversarial In-Context Generation |
24.08 |
FAR AI |
arxiv |
Scaling Laws for Data Poisoning in LLMs |
Data Poisoning&Scaling Laws |
24.08 |
The University of Western Australia |
arxiv |
A Study on Prompt Injection Attack Against LLM-Integrated Mobile Robotic Systems |
Mobile Robot&Prompt Injection |
24.08 |
Fudan University |
arxiv |
EnJa: Ensemble Jailbreak on Large Language Models |
Jailbreaking&Security |
24.08 |
Bocconi University |
arxiv |
Compromesso! Italian Many-Shot Jailbreaks Undermine the Safety of Large Language Models |
Jailbreaking&Multilingual Safety |
24.08 |
Xidian University |
arxiv |
Multi-Turn Context Jailbreak Attack on Large Language Models From First Principles |
Multi-Turn Jailbreak Attack&Contextual Fusion Attack |
24.08 |
Cornell Tech |
arxiv |
A Jailbroken GenAI Model Can Cause Substantial Harm: GenAI-powered Applications are Vulnerable to PromptWares |
Jailbroken GenAI Models&PromptWares&GenAI-powered Applications |
24.08 |
University of California Irvine |
arxiv |
Using Retriever Augmented Large Language Models for Attack Graph Generation |
Retriever Augmented Generation&Attack Graphs&Cybersecurity |
24.08 |
University of California, Los Angeles |
CCS 2024 |
BadMerging: Backdoor Attacks Against Model Merging |
Backdoor Attack&Model Merging&AI Security |
24.08 |
Stanford University |
arxiv |
Kov: Transferable and Naturalistic Black-Box LLM Attacks using Markov Decision Processes and Tree Search |
Black-Box Attacks&Markov Decision Processes&Monte Carlo Tree Search |
24.08 |
Shanghai Jiao Tong University |
arxiv |
Transferring Backdoors between Large Language Models by Knowledge Distillation |
Backdoor Attacks&Knowledge Distillation |
24.08 |
Tsinghua University |
arxiv |
Probing the Safety Response Boundary of Large Language Models via Unsafe Decoding Path Generation |
Safety Response Boundary&Unsafe Decoding Path |
24.08 |
Singapore University of Technology and Design |
arxiv |
FERRET: Faster and Effective Automated Red Teaming with Reward-Based Scoring Technique |
Automated Red Teaming&Adversarial Prompts&Reward-Based Scoring |
24.08 |
Shanghai Jiao Tong University |
arxiv |
MEGen: Generative Backdoor in Large Language Models via Model Editing |
Backdoor Attacks&Model Editing |
24.08 |
Chinese Academy of Sciences |
arxiv |
DiffZOO: A Purely Query-Based Black-Box Attack for Red-Teaming Text-to-Image Generative Model via Zeroth Order Optimization |
Black-Box Attack&Text-to-Image Generative Model&Zeroth Order Optimization |
24.08 |
The Pennsylvania State University |
arxiv |
Hide Your Malicious Goal Into Benign Narratives: Jailbreak Large Language Models through Neural Carrier Articles |
Jailbreak Attacks&Prompt Injection |
24.08 |
Xi'an Jiaotong University |
arxiv |
Unlocking Adversarial Suffix Optimization Without Affirmative Phrases: Efficient Black-box Jailbreaking via LLM as Optimizer |
Jailbreak Attacks&Adversarial Suffixes |
24.08 |
Nanjing University of Information Science and Technology |
arxiv |
Large Language Models are Good Attackers: Efficient and Stealthy Textual Backdoor Attacks |
Textual Backdoor Attacks&Sample Selection |
24.08 |
Nankai University |
arxiv |
RT-Attack: Jailbreaking Text-to-Image Models via Random Token |
Jailbreak&Text-to-Image&Adversarial Attacks |
24.08 |
Harbin Institute of Technology, Shenzhen |
arxiv |
TF-Attack: Transferable and Fast Adversarial Attacks on Large Language Models |
Adversarial Attack&Transferability&Efficiency |
24.08 |
Shenzhen Research Institute of Big Data |
arxiv |
Detecting AI Flaws: Target-Driven Attacks on Internal Faults in Language Models |
Target-Driven Attacks&Internal Faults&Reinforcement Learning |
24.08 |
National University of Singapore |
arxiv |
Advancing Adversarial Suffix Transfer Learning on Aligned Large Language Models |
Adversarial Suffixes&Transfer Learning&Jailbreak |
24.08 |
Scale AI |
arxiv |
LLM Defenses Are Not Robust to Multi-Turn Human Jailbreaks Yet |
Multi-Turn Jailbreaks&LLM Defense&Human Red Teaming |
24.09 |
University of California, Berkeley |
arxiv |
Emerging Vulnerabilities in Frontier Models: Multi-Turn Jailbreak Attacks |
Multi-Turn Jailbreak&Frontier Models&LLM Security |
24.09 |
University of Southern California |
arxiv |
Rethinking Backdoor Detection Evaluation for Language Models |
Backdoor Attacks&Detection Robustness&Training Intensity |
24.09 |
Michigan State University |
arxiv |
The Dark Side of Human Feedback: Poisoning Large Language Models via User Inputs |
User-Guided Poisoning&RLHF&Toxicity Manipulation |
24.09 |
University of Cambridge |
arxiv |
Conversational Complexity for Assessing Risk in Large Language Models |
Conversational Complexity&Risk Assessment |
24.09 |
Independent |
arxiv |
Well, that escalated quickly: The Single-Turn Crescendo Attack (STCA) |
Single-Turn Crescendo Attack&Adversarial Attacks |
24.09 |
CISPA Helmholtz Center for Information Security |
CCS 2024 |
Membership Inference Attacks Against In-Context Learning |
Membership Inference Attacks&In-Context Learning |
24.09 |
Radboud University, Ikerlan Research Centre |
arxiv |
Context is the Key: Backdoor Attacks for In-Context Learning with Vision Transformers |
Backdoor Attacks&In-Context Learning&Vision Transformers |
24.09 |
Institute of Information Engineering, Chinese Academy of Sciences |
arxiv |
AdaPPA: Adaptive Position Pre-Fill Jailbreak Attack Approach Targeting LLMs |
Jailbreak Attacks&Adaptive Position Pre-Fill |
24.09 |
Technion - Israel Institute of Technology, Intuit, Cornell Tech |
arxiv |
Unleashing Worms and Extracting Data: Escalating the Outcome of Attacks against RAG-based Inference in Scale and Severity Using Jailbreaking |
Jailbreaking&RAG Inference&Data Extraction |
24.09 |
University of Texas at San Antonio |
arxiv |
Jailbreaking Large Language Models with Symbolic Mathematics |
Jailbreaking&Symbolic Mathematics |
24.09 |
Beihang University |
arxiv |
PathSeeker: Exploring LLM Security Vulnerabilities with a Reinforcement Learning-Based Jailbreak Approach |
LLM Security Vulnerabilities&Jailbreak Attack&Reinforcement Learning |
24.09 |
AWS AI |
arxiv |
Order of Magnitude Speedups for LLM Membership Inference |
Membership Inference&Quantile Regression |
24.09 |
Nanyang Technological University |
arxiv |
Effective and Evasive Fuzz Testing-Driven Jailbreaking Attacks against LLMs |
Jailbreaking Attacks&Fuzz Testing&LLM Security |
24.09 |
Hippocratic AI |
arxiv |
RED QUEEN: Safeguarding Large Language Models against Concealed Multi-Turn Jailbreaking |
Jailbreaking&Multi-Turn Attacks&Concealed Attacks |
24.09 |
Nanyang Technological University |
arxiv |
Weak-to-Strong Backdoor Attacks for LLMs with Contrastive Knowledge Distillation |
Backdoor Attacks&Contrastive Knowledge Distillation&Parameter-Efficient Fine-Tuning |
24.09 |
Georgia Institute of Technology |
arxiv |
Harmful Fine-tuning Attacks and Defenses for Large Language Models: A Survey |
Harmful Fine-tuning&LLM Attacks&LLM Defenses |
24.09 |
Institut Polytechnique de Paris |
arxiv |
Read Over the Lines: Attacking LLMs and Toxicity Detection Systems with ASCII Art to Mask Profanity |
ASCII Art&LLM Attacks&Toxicity Detection |
24.09 |
LMU Munich |
arxiv |
Multimodal Pragmatic Jailbreak on Text-to-image Models |
Multimodal Pragmatic Jailbreak&Text-to-image Models&Safety Filters |
24.10 |
Stony Brook University |
arxiv |
BUCKLE UP: ROBUSTIFYING LLMS AT EVERY CUSTOMIZATION STAGE VIA DATA CURATION |
Jailbreaking&LLM Customization&Data Curation |
24.10 |
National University of Singapore |
arxiv |
FLIPATTACK: Jailbreak LLMs via Flipping |
Jailbreak&Adversarial Attacks |
24.10 |
University of Wisconsin–Madison, NVIDIA |
arxiv |
AUTODAN-TURBO: A Lifelong Agent for Strategy Self-Exploration to Jailbreak LLMs |
Jailbreak&Strategy Self-Exploration |
24.10 |
University College London, Stanford University |
arxiv |
Prompt Infection: LLM-to-LLM Prompt Injection within Multi-Agent Systems |
Prompt Infection&Multi-Agent Systems |
24.10 |
University of Wisconsin–Madison |
arxiv |
RePD: Defending Jailbreak Attack through a Retrieval-based Prompt Decomposition Process |
Jailbreak attack&Prompt decomposition&LLMs defense |
24.10 |
UC Santa Cruz, Johns Hopkins University, University of Edinburgh, Peking University |
arxiv |
AttnGCG: Enhancing Jailbreaking Attacks on LLMs with Attention Manipulation |
Jailbreaking&LLMs vulnerability&Optimization-based attacks |
24.10 |
Independent Researcher |
arxiv |
Recent advancements in LLM Red-Teaming: Techniques, Defenses, and Ethical Considerations |
LLM red-teaming&Jailbreaking defenses&Prompt engineering |
24.10 |
Beihang University, Tsinghua University, Peking University |
arxiv |
BLACKDAN: A Black-Box Multi-Objective Approach for Effective and Contextual Jailbreaking of Large Language Models |
Jailbreak&Multi-objective optimization&Black-box attack |
24.10 |
Shanghai Jiao Tong University, Shanghai Artificial Intelligence Laboratory, Beihang University |
arxiv |
Derail Yourself: Multi-Turn LLM Jailbreak Attack Through Self-Discovered Clues |
Multi-turn attacks&Jailbreak&Self-discovered clues |
24.10 |
Tsinghua University, Sea AI Lab, Peng Cheng Laboratory |
arxiv |
Denial-of-Service Poisoning Attacks on Large Language Models |
Denial-of-Service&Poisoning attack |
24.10 |
University of New Haven, Robust Intelligence |
arxiv |
COGNITIVE OVERLOAD ATTACK: PROMPT INJECTION FOR LONG CONTEXT |
Cognitive overload&Prompt injection&Jailbreak |
24.10 |
Harbin Institute of Technology, Tencent, University of Glasgow, Independent Researcher |
arxiv |
DECIPHERING THE CHAOS: ENHANCING JAILBREAK ATTACKS VIA ADVERSARIAL PROMPT TRANSLATION |
Jailbreak attacks&Adversarial prompt&Gradient-based optimization |
24.10 |
Monash University |
arxiv |
JIGSAW PUZZLES: Splitting Harmful Questions to Jailbreak Large Language Models |
Jailbreak&Multi-turn attack&Query splitting |
24.10 |
Wuhan University |
arxiv |
Multi-Round Jailbreak Attack on Large Language Models |
Jailbreak&Multi-round attack |
24.10 |
The Hong Kong University of Science and Technology (Guangzhou), University of Birmingham, Baidu Inc. |
arxiv |
JAILJUDGE: A Comprehensive Jailbreak Judge Benchmark with Multi-Agent Enhanced Explanation Evaluation Framework |
Jailbreak judge&Multi-agent framework |
24.10 |
Theori Inc. |
ICLR 2025 |
DO LLMS HAVE POLITICAL CORRECTNESS? Analyzing Ethical Biases and Jailbreak Vulnerabilities in AI Systems |
Political correctness&Jailbreak&Ethical vulnerabilities |
24.10 |
University of Manitoba |
arxiv |
SoK: Prompt Hacking of Large Language Models |
Prompt Hacking&Jailbreak Attacks |
24.10 |
Thales DIS |
arxiv |
Backdoored Retrievers for Prompt Injection Attacks on Retrieval-Augmented Generation of Large Language Models |
Retrieval-Augmented Generation&Prompt Injection&Backdoor Attacks |
24.10 |
Duke University |
arxiv |
Making LLMs Vulnerable to Prompt Injection via Poisoning Alignment |
Prompt Injection&Poisoning Alignment&LLM Vulnerabilities |
24.10 |
Tsinghua University |
arxiv |
Faster-GCG: Efficient Discrete Optimization Jailbreak Attacks against Aligned Large Language Models |
Jailbreak Attacks&Discrete Optimization&Adversarial Attacks |
24.10 |
International Digital Economy Academy |
arxiv |
SMILES-Prompting: A Novel Approach to LLM Jailbreak Attacks in Chemical Synthesis |
SMILES-Prompting&Jailbreak Attacks&Chemical Safety |
24.10 |
Beijing University of Posts and Telecommunications |
arxiv |
FEINT AND ATTACK: Attention-Based Strategies for Jailbreaking and Protecting LLMs |
Attention Mechanisms&Jailbreak Attacks&Defense Strategies |
24.10 |
IBM Research AI |
arxiv |
Breaking ReAct Agents: Foot-in-the-Door Attack Will Get You In |
ReAct Agents&Prompt Injection&Foot-in-the-Door Attack |
24.10 |
Google DeepMind |
arxiv |
Remote Timing Attacks on Efficient Language Model Inference |
Timing Attacks&Efficient Inference&Privacy |
24.10 |
Meta |
arxiv |
Towards Understanding the Fragility of Multilingual LLMs against Fine-Tuning Attacks |
Fine-Tuning Attacks&Multilingual LLMs&Safety Alignment |
24.10 |
University of California San Diego |
arxiv |
Iterative Self-Tuning LLMs for Enhanced Jailbreaking Capabilities |
Jailbreaking&LLM Vulnerabilities&Adversarial Attacks |
24.10 |
Florida State University |
arxiv |
Adversarial Attacks on Large Language Models Using Regularized Relaxation |
Adversarial Attacks&Continuous Optimization |
24.10 |
The Pennsylvania State University |
arxiv |
Humanizing the Machine: Proxy Attacks to Mislead LLM Detectors |
Adversarial Attacks&Detection Evasion |
24.10 |
Nanyang Technological University |
arxiv |
Mask-based Membership Inference Attacks for Retrieval-Augmented Generation |
Retrieval-Augmented Generation&Membership Inference Attacks&Privacy |
24.10 |
George Mason University |
arxiv |
Hacking Back the AI-Hacker: Prompt Injection as a Defense Against LLM-driven Cyberattacks |
Prompt Injection Defense&LLM Cybersecurity&Adversarial Inputs |
24.10 |
Fudan University |
arxiv |
BlueSuffix: Reinforced Blue Teaming for Vision-Language Models Against Jailbreak Attacks |
Jailbreak Defense&Vision-Language Models&Reinforcement Learning |
24.10 |
Harbin Institute of Technology |
arxiv |
Stealthy Jailbreak Attacks on Large Language Models via Benign Data Mirroring |
Jailbreak Attacks&Adversarial Prompts |
24.10 |
SRM Institute of Science and Technology |
arxiv |
Palisade - Prompt Injection Detection Framework |
Prompt Injection&Heuristic-based Detection |
24.10 |
The Ohio State University |
arxiv |
AmpleGCG-Plus: A Strong Generative Model of Adversarial Suffixes to Jailbreak LLMs with Higher Success Rates in Fewer Attempts |
Jailbreak Attacks&Adversarial Suffixes&Generative Models |
24.10 |
Zhejiang University |
arxiv |
HIJACKRAG: Hijacking Attacks against Retrieval-Augmented Large Language Models |
Retrieval-Augmented Generation&Prompt Injection Attacks&Security Vulnerability |
24.10 |
The Baldwin School |
arxiv |
Systematically Analyzing Prompt Injection Vulnerabilities in Diverse LLM Architectures |
Prompt Injection Vulnerabilities&Model Susceptibility |
24.10 |
Competition for LLM and Agent Safety 2024 |
arxiv |
Transferable Ensemble Black-box Jailbreak Attacks on Large Language Models |
Black-box Jailbreak Attacks&Ensemble Methods |
24.10 |
University of Electronic Science and Technology of China |
arxiv |
Pseudo-Conversation Injection for LLM Goal Hijacking |
Goal Hijacking&Prompt Injection |
24.10 |
Monash University |
arxiv |
Audio Is the Achilles’ Heel: Red Teaming Audio Large Multimodal Models |
Audio Multimodal Models&Safety Vulnerabilities&Jailbreak Attacks |
24.11 |
Fudan University |
arxiv |
IDEATOR: Jailbreaking VLMs Using VLMs |
Vision-Language Models&Jailbreak Attack&Multimodal Safety |
24.11 |
International Computer Science Institute |
arxiv |
Emoji Attack: A Method for Misleading Judge LLMs in Safety Risk Detection |
Emoji Attack&Jailbreaking&Judge LLMs Bias |
24.11 |
Peking University |
arxiv |
B4: A Black-Box ScruBBing Attack on LLM Watermarks |
Black-Box Attack&Watermark Removal&Adversarial Text Generation |
24.11 |
University of Science and Technology of China |
arxiv |
SQL Injection Jailbreak: a structural disaster of large language models |
SQL Injection&Jailbreak Attack&LLM Vulnerability |
24.11 |
University of Illinois Urbana-Champaign |
arxiv |
Stochastic Monkeys at Play: Random Augmentations Cheaply Break LLM Safety Alignment |
Random Augmentations&Safety Alignment&LLM Jailbreak |
24.11 |
Cambridge ERA: AI Fellowship |
arxiv |
What Features in Prompts Jailbreak LLMs? Investigating the Mechanisms Behind Attacks |
Jailbreak Prompts&Nonlinear Probes&Adversarial Attacks |
24.11 |
Alibaba Group |
arxiv |
MRJ-Agent: An Effective Jailbreak Agent for Multi-Round Dialogue |
Multi-Round Dialogue&Jailbreak Agent&LLM Vulnerability |
24.11 |
Columbia University |
arxiv |
Diversity Helps Jailbreak Large Language Models |
Jailbreak Techniques&LLM Safety&Prompt Diversity |
24.11 |
Bangladesh University of Engineering and Technology |
arXiv |
SequentialBreak: Large Language Models Can Be Fooled by Embedding Jailbreak Prompts into Sequential Prompt Chains |
Jailbreak Attacks&Prompt Engineering&LLM Vulnerabilities |
24.11 |
Xi'an Jiaotong-Liverpool University |
arXiv |
Target-driven Attack for Large Language Models |
Black-box Attacks&Optimization Methods |
24.11 |
Georgia Institute of Technology |
arXiv |
LLM STINGER: Jailbreaking LLMs using RL Fine-tuned LLMs |
Jailbreaking Attacks&Reinforcement Learning&Adversarial Suffixes |
24.11 |
Beijing University of Posts and Telecommunications |
arXiv |
Jailbreak Attacks and Defenses against Multimodal Generative Models: A Survey |
Jailbreak Attacks&Multimodal Generative Models&Security Challenges |
24.11 |
Arizona State University |
NeurIPS 2024 SafeGenAI Workshop |
Zer0-Jack: A Memory-efficient Gradient-based Jailbreaking Method for Black-box Multi-modal Large Language Models |
Black-box Jailbreaking&Multi-modal Models&Zeroth-order Optimization |
24.11 |
Zhejiang University |
arxiv |
JailbreakLens: Interpreting Jailbreak Mechanism in the Lens of Representation and Circuit |
Jailbreak Attacks&Large Language Models&Mechanism Interpretability |
24.11 |
University of Electronic Science and Technology of China |
arxiv |
Safe + Safe = Unsafe? Exploring How Safe Images Can Be Exploited to Jailbreak Large Vision-Language Models |
Jailbreaking&Large Vision-Language Models&Safety Snowball Effect |
24.11 |
Tsinghua University |
arxiv |
Playing Language Game with LLMs Leads to Jailbreaking |
Jailbreaking&Language Games&LLM Safety |
24.11 |
University of Texas at Dallas |
arxiv |
AttentionBreaker: Adaptive Evolutionary Optimization for Unmasking Vulnerabilities in LLMs through Bit-Flip Attacks |
Bit-Flip Attacks&Model Vulnerability Optimization |
24.11 |
BITS Pilani |
arxiv |
GASP: Efficient Black-Box Generation of Adversarial Suffixes for Jailbreaking LLMs |
Jailbreaking&Latent Bayesian Optimization&Adversarial Prompts |
24.11 |
Nanyang Technological University, Wuhan University |
arxiv |
Neutralizing Backdoors through Information Conflicts for Large Language Models |
Backdoor Defense&Information Conflicts&Model Security |
24.11 |
Duke University, University of Louisville |
arxiv |
LoBAM: LoRA-Based Backdoor Attack on Model Merging |
Model Merging&Backdoor Attack&LoRA |
24.11 |
Université de Sherbrooke&University of Kinshasa |
arxiv |
Preventing Jailbreak Prompts as Malicious Tools for Cybercriminals: A Cyber Defense Perspective |
Jailbreak Prompts&Cyber Defense&AI Security |