Skip to content

Latest commit

 

History

History
394 lines (380 loc) · 190 KB

Jailbreaks&Attack.md

File metadata and controls

394 lines (380 loc) · 190 KB

Jailbreaks&Attack

Different from the main README🕵️

  • Within this subtopic, we will be updating with the latest articles. This will help researchers in this area to quickly understand recent trends.
  • In addition to providing the most recent updates, we will also add keywords to each subtopic to help you find content of interest more quickly.
  • Within each subtopic, we will also update with profiles of scholars we admire and endorse in the field. Their work is often of high quality and forward-looking!"

📑Papers

Date Institute Publication Paper Keywords
20.12 Google USENIX Security 2021 Extracting Training Data from Large Language Models Verbatim Text Sequences&Rank Likelihood
22.11 AE Studio NIPS2022(ML Safety Workshop) Ignore Previous Prompt: Attack Techniques For Language Models Prompt Injection&Misaligned
23.02 Saarland University arxiv Not what you've signed up for: Compromising Real-World LLM-Integrated Applications with Indirect Prompt Injection Adversarial Prompting&Indirect Prompt Injection&LLM-Integrated Applications
23.04 Hong Kong University of Science and Technology EMNLP2023(findings) Multi-step Jailbreaking Privacy Attacks on ChatGPT Privacy&Jailbreaks
23.04 University of Michigan&Arizona State University&NVIDIA NAACL2024 ChatGPT as an Attack Tool: Stealthy Textual Backdoor Attack via Blackbox Generative Model Trigger Textual Backdoor Attack&Blackbox Generative Model&Trigger Detection
23.05 Jinan University, Hong Kong University of Science and Technology, Nanyang Technological University, Zhejiang University EMNLP 2023 Prompt as Triggers for Backdoor Attack: Examining the Vulnerability in Language Models Backdoor Attacks
23.05 Nanyang Technological University, University of New South Wales, Virginia Tech arXiv Jailbreaking ChatGPT via Prompt Engineering: An Empirical Study Large Jailbreak&Prompt Engineering
23.06 Princeton University AAAI 2024 Visual Adversarial Examples Jailbreak Aligned Large Language Models Visual Language Models&Adversarial Attacks&AI Alignment
23.06 Nanyang Technological University, University of New South Wales, Huazhong University of Science and Technology, Southern University of Science and Technology, Tianjin University arxiv Prompt Injection attack against LLM-integrated Applications &LLM-integrated Applications&Security Risks&Prompt Injection Attacks
23.06 Google arxiv Are aligned neural networks adversarially aligned? Multimodal&Jailbreak
23.07 CMU arxiv Universal and Transferable Adversarial Attacks on Aligned Language Models Jailbreak&Transferable Attack&Adversarial Attack
23.07 Language Technologies Institute Carnegie Mellon University arXiv Prompts Should not be Seen as Secrets: Systematically Measuring Prompt Extraction Attack Success Prompt Extraction&Attack Success Measurement&Defensive Strategies
23.07 Nanyang Technological University NDSS2023 MasterKey: Automated Jailbreak Across Multiple Large Language Model Chatbots Jailbreak&Reverse-Engineering&Automatic Generation
23.07 Cornell Tech arxiv Abusing Images and Sounds for Indirect Instruction Injection in Multi-Modal LLMs Multi-Modal LLMs&Indirect Instruction Injection&Adversarial Perturbations
23.07 UNC Chapel Hill, Google DeepMind, ETH Zurich AdvML Frontiers Workshop 2023 Backdoor Attacks for In-Context Learning with Language Models Backdoor Attacks&In-Context Learning
23.07 Google DeepMind arXiv Large language models (LLMs) are now highly capable at a diverse range of tasks Adversarial Machine Learning&AI-Guardian&Defense Robustness
23.08 CISPA Helmholtz Center for Information Security; NetApp arxiv “Do Anything Now”: Characterizing and Evaluating In-The-Wild Jailbreak Prompts on Large Language Models Jailbreak Prompts&Adversarial Prompts&Proactive Detection
23.09 Ben-Gurion University, DeepKeep arxiv OPEN SESAME! UNIVERSAL BLACK BOX JAILBREAKING OF LARGE LANGUAGE MODELS Genetic Algorithm&Adversarial Prompt&Black Box Jailbreak
23.10 Princeton University, Virginia Tech, IBM Research, Stanford University arxiv FINE-TUNING ALIGNED LANGUAGE MODELS COMPROMISES SAFETY EVEN WHEN USERS DO NOT INTEND TO! Fine-tuning****Safety Risks&Adversarial Training
23.10 University of California Santa Barbara, Fudan University, Shanghai AI Laboratory arxiv SHADOW ALIGNMENT: THE EASE OF SUBVERTING SAFELY-ALIGNED LANGUAGE MODELS AI Safety&Malicious Use&Fine-tuning
23.10 Peking University arxiv Jailbreak and Guard Aligned Language Models with Only Few In-Context Demonstrations In-Context Learning&Adversarial Attacks&In-Context Demonstrations
23.10 University of Pennsylvania arxiv Jailbreaking Black Box Large Language Models in Twenty Queries Prompt Automatic Iterative Refinement&Jailbreak
23.10 University of Maryland College Park, Adobe Research arxiv AutoDAN: Interpretable Gradient-Based Adversarial Attacks on Large Language Models Adversarial Attacks&Interpretabilty&Jailbreaking
23.11 MBZUAI arxiv Robustness Tests for Automatic Machine Translation Metrics with Adversarial Attacks Adversarially-synthesized Texts&Word-level Attacks&Evaluation
23.11 Palisade Research arxiv BadLlama: cheaply removing safety fine-tuning from Llama 2-Chat 13B Remove Safety Fine-tuning
23.11 University of Twente ICNLSP 2023 Efficient Black-Box Adversarial Attacks on Neural Text Detectors Misclassification&Adversarial attacks
23.11 PRISM AI&Harmony Intelligenc&Leap Laboratories arxiv Scalable and Transferable Black-Box Jailbreaks for Language Models via Persona Modulation Persona-modulation Attacks&Jailbreaks&Automated Prompt
23.11 Tsinghua University arxiv Jailbreaking Large Vision-Language Models via Typographic Visual Prompts Typographic Attack&Multi-modal&Safety Evaluation
23.11 Huazhong University of Science and Technology, Tsinghua University arxiv Practical Membership Inference Attacks against Fine-tuned Large Language Models via Self-prompt Calibration Membership Inference Attacks&Privacy and Security
23.11 Nanjing University, Meituan Inc arxiv A Wolf in Sheep’s Clothing: Generalized Nested Jailbreak Prompts can Fool Large Language Models Easily Jailbreak Prompts&Safety Alignment&Safeguard Effectiveness
23.11 Google DeepMind arxiv Frontier Language Models Are Not Robust to Adversarial Arithmetic or "What Do I Need To Say So You Agree 2+2=5?" Adversarial Arithmetic&Model Robustness&Adversarial Attacks
23.11 University of Illinois Chicago, Texas A&M University arxiv DALA: A Distribution-Aware LoRA-Based Adversarial Attack against Pre-trained Language Models Adversarial Attack&Distribution-Aware&LoRA-Based Attack
23.11 Illinois Institute of Technology arxiv Backdoor Activation Attack: Attack Large Language Models using Activation Steering for Safety-Alignment Backdoor Activation Attack&Large Language Models&AI Safety&Activation Steering&Trojan Steering Vectors
23.11 Wayne State University arXiv Hijacking Large Language Models via Adversarial In-Context Learning Adversarial Attacks&Gradient-Based Prompt Search&Adversarial Suffixes
23.11 Hong Kong Baptist University, Shanghai Jiao Tong University, Shanghai AI Laboratory, The University of Sydney arXiv DeepInception: Hypnotize Large Language Model to Be Jailbreaker Jailbreak&DeepInception
23.11 Xi’an Jiaotong-Liverpool University arxiv Generating Valid and Natural Adversarial Examples with Large Language Models Adversarial examples&Text classification
23.11 Michigan State University arxiv Beyond Boundaries: A Comprehensive Survey of Transferable Attacks on AI Systems Transferable Attacks&AI Systems&Adversarial Attacks
23.11 Tsinghua University & Kuaishou Technology arxiv Evil Geniuses: Delving into the Safety of LLM-based Agents LLM-based Agents&Safety&Malicious Attacks
23.11 Cornell University arxiv Language Model Inversion Model Inversion&Prompt Reconstruction&Privacy
23.11 ETH Zurich arxiv Universal Jailbreak Backdoors from Poisoned Human Feedback RLHF&Backdoor Attacks
23.11 UC Santa Cruz, UNC-Chapel Hill arxiv How Many Are in This Image? A Safety Evaluation Benchmark for Vision LLMs Vision Large Language Models&Safety Evaluation&Adversarial Robustness
23.11 Texas Tech University arxiv Exploiting Large Language Models (LLMs) through Deception Techniques and Persuasion Principles Social Engineering&Security&Prompt Engineering
23.11 Johns Hopkins University arxiv Instruct2Attack: Language-Guided Semantic Adversarial Attacks Language-guided Attacks&Latent Diffusion Models&Adversarial Attack
23.11 Google DeepMind, University of Washington, Cornell, CMU, UC Berkeley, ETH Zurich arxiv Scalable Extraction of Training Data from (Production) Language Models Extractable Memorization&Data Extraction&Adversary Attacks
23.11 University of Maryland, Mila, Towards AI, Stanford, Technical University of Sofia, University of Milan, NYU arxiv Ignore This Title and HackAPrompt: Exposing Systemic Vulnerabilities of LLMs through a Global Scale Prompt Hacking Competition Prompt Hacking&Security Threats
23.11 University of Washington, UIUC, Pennsylvania State University, University of Chicago arxiv IDENTIFYING AND MITIGATING VULNERABILITIES IN LLM-INTEGRATED APPLICATIONS LLM-Integrated Applications&Attack Surfaces
23.11 Jinan University, Guangzhou Xuanyuan Research Institute Co. Ltd., The Hong Kong Polytechnic University arxiv TARGET: Template-Transferable Backdoor Attack Against Prompt-based NLP Models via GPT4 Prompt-based Learning&Backdoor Attack
23.11 Nanjing University&Meituan Inc. NAACL2024 A Wolf in Sheep’s Clothing: Generalized Nested Jailbreak Prompts can Fool Large Language Models Easily Jailbreak Prompts&LLM Security&Automated Framework
23.11 University of Southern California NAACL2024(findings) Cognitive Overload: Jailbreaking Large Language Models with Overloaded Logical Thinking Jailbreaking&Large Language Models&Cognitive Overload
23.12 The Pennsylvania State University arxiv Stealthy and Persistent Unalignment on Large Language Models via Backdoor Injections Backdoor Injection&Safety Alignment
23.12 Drexel University arXiv A Survey on Large Language Model (LLM) Security and Privacy: The Good the Bad and the Ugly Security&Privacy&Attacks
23.12 Yale University, Robust Intelligence arXiv Tree of Attacks: Jailbreaking Black-Box LLMs Automatically Tree of Attacks with Pruning (TAP)&Jailbreaking&Prompt Generation
23.12 Independent (Now at Google DeepMind) arXiv Scaling Laws for Adversarial Attacks on Language Model Activations Adversarial Attacks&Language Model Activations&Scaling Laws
23.12 Harbin Institute of Technology arxiv Analyzing the Inherent Response Tendency of LLMs: Real-World Instructions-Driven Jailbreak Jailbreak Attack&Inherent Response Tendency&Affirmation Tendency
23.12 University of Wisconsin-Madison arxiv DeceptPrompt: Exploiting LLM-driven Code Generation via Adversarial Natural Language Instructions Code Generation&Adversarial Attacks&Cybersecurity
23.12 Carnegie Melon University, IBM Research arxiv Forcing Generative Models to Degenerate Ones: The Power of Data Poisoning Attacks Data Poisoning Attacks&Natural Language Generation&Cybersecurity
23.12 Purdue University NIPS2023(Workshop) Make Them Spill the Beans! Coercive Knowledge Extraction from (Production) LLMs Knowledge Extraction&Interrogation Techniques&Cybersecurity
23.12 Sungkyunkwan University, University of Tennessee arXiv Poisoned ChatGPT Finds Work for Idle Hands: Exploring Developers’ Coding Practices with Insecure Suggestions from Poisoned AI Models Poisoning Attacks&Software Development
23.12 North Carolina State University, New York University, Stanford University arXiv BEYOND GRADIENT AND PRIORS IN PRIVACY ATTACKS: LEVERAGING POOLER LAYER INPUTS OF LANGUAGE MODELS IN FEDERATED LEARNING Federated Learning&Privacy Attacks
23.12 Korea Advanced Institute of Science, Graduate School of AI arxiv Hijacking Context in Large Multi-modal Models Large Multi-modal Models&Context Hijacking
23.12 Xi’an Jiaotong University, Nanyang Technological University, Singapore Management University arXiv A Mutation-Based Method for Multi-Modal Jailbreaking Attack Detection Jailbreaking Detection&Multi-Modal
23.12 Logistic and Supply Chain MultiTech R&D Centre (LSCM) UbiSec-2023 A Comprehensive Survey of Attack Techniques Implementation and Mitigation Strategies in Large Language Models Cybersecurity Attacks&Defense Strategies
23.12 University of Illinois Urbana-Champaign, VMware Research arXiv BYPASSING THE SAFETY TRAINING OF OPEN-SOURCE LLMS WITH PRIMING ATTACKS Safety Training&Priming Attacks
23.12 Delft University of Technology ICSE 2024 Traces of Memorisation in Large Language Models for Code Code Memorisation&Data Extraction Attacks
23.12 University of Science and Technology of China, Hong Kong University of Science and Technology, Microsoft arxiv Benchmarking and Defending Against Indirect Prompt Injection Attacks on Large Language Models Indirect Prompt Injection Attacks&BIPIA Benchmark&Defense
23.12 Nanjing University of Aeronautics and Astronautics NLPCC2023 Punctuation Matters! Stealthy Backdoor Attack for Language Models Backdoor Attack&PuncAttack&Stealthiness
23.12 FAR AI, McGill University, MILA, Jagiellonian University arXiv Exploiting Novel GPT-4 APIs Fine-Tuning&Knowledge Retrieval&Security Vulnerabilities
23.12 EPFL Adversarial Attacks on GPT-4 via Simple Random Search Adversarial Attacks&Random Search&Jailbreak
24.01 Logistic and Supply Chain MultiTech R&D Centre (LSCM) CSDE2023 A Novel Evaluation Framework for Assessing Resilience Against Prompt Injection Attacks in Large Language Models Evaluation&Prompt Injection&Cyber Security
24.01 University of Southern California arxiv The Butterfly Effect of Altering Prompts: How Small Changes and Jailbreaks Affect Large Language Model Performance Prompt Engineering&Text Classification&Jailbreaks
24.01 Virginia Tech, Renmin University of China, UC Davis, Stanford University arxiv How Johnny Can Persuade LLMs to Jailbreak Them: Rethinking Persuasion to Challenge AI Safety by Humanizing LLMs AI Safety&Persuasion Adversarial Prompts&Jailbreak
24.01 Anthropic, Redwood Research, Mila Quebec AI Institute, University of Oxford arxiv SLEEPER AGENTS: TRAINING DECEPTIVE LLMS THAT PERSIST THROUGH SAFETY TRAINING Deceptive Behavior&Safety Training&Backdoored Behavior&Adversarial Training
24.01 Jinan University,Nanyang Technological University, Beijing Institute of Technology, Pazhou Lab arxiv UNIVERSAL VULNERABILITIES IN LARGE LANGUAGE MODELS: IN-CONTEXT LEARNING BACKDOOR ATTACKS In-context Learning&Security&Backdoor Attacks
24.01 Carnegie Mellon University arxiv Combating Adversarial Attacks with Multi-Agent Debate Adversarial Attacks&Multi-Agent Debate&Red Team
24.01 Fudan University arxiv Open the Pandora’s Box of LLMs: Jailbreaking LLMs through Representation Engineering LLM Security&Representation Engineering
24.01 Northwestern University, New York University, University of Liverpool, Rutgers University arxiv AttackEval: How to Evaluate the Effectiveness of Jailbreak Attacking on Large Language Models Jailbreak Attack&Evaluation Frameworks&Ground Truth Dataset
24.01 Kyushu Institute of Technology arxiv ALL IN HOW YOU ASK FOR IT: SIMPLE BLACK-BOX METHOD FOR JAILBREAK ATTACKS Jailbreak Attacks&Black-box Method
24.01 MIT arXiv Pruning for Protection: Increasing Jailbreak Resistance in Aligned LLMs Without Fine-Tuning Jailbreaking&Model Safety
24.01 Aalborg University arxiv Text Embedding Inversion Attacks on Multilingual Language Models Text Embedding&Inversion Attacks&Multilingual Language Models
24.01 University of Illinois Urbana-Champaign, University of Washington, Western Washington University arxiv BADCHAIN: BACKDOOR CHAIN-OF-THOUGHT PROMPTING FOR LARGE LANGUAGE MODELS Chain-of-Thought Prompting&Backdoor Attacks
24.01 The University of Hong Kong, Zhejiang University arxiv Red Teaming Visual Language Models Vision-Language Models&Red Teaming
24.01 University of California Santa Barbara,Sea AI Lab Singapore, Carnegie Mellon University arxiv Weak-to-Strong Jailbreaking on Large Language Models Jailbreaking&Adversarial Prompts&AI Safety
24.02 Boston University arxiv Vision-LLMs Can Fool Themselves with Self-Generated Typographic Attacks Large Vision-Language Models&Typographic Attacks&Self-Generated Attacks
24.02 Copenhagen Business School, Temple University arxiv An Early Categorization of Prompt Injection Attacks on Large Language Models Prompt Injection&Categorization
24.02 Michigan State University, Okinawa Institute of Science and Technology (OIST) arxiv Data Poisoning for In-context Learning In-context learning&Data poisoning&Security
24.02 CISPA Helmholtz Center for Information Security arxiv Conversation Reconstruction Attack Against GPT Models Conversation Reconstruction Attack&Privacy risks&Security
24.02 University of Illinois Urbana-Champaign, Center for AI Safety, Carnegie Mellon University, UC Berkeley, Microsoft arxiv HarmBench: A Standardized Evaluation Framework for Automated Red Teaming and Robust Refusal Automated Red Teaming&Robust Refusal
24.02 University of Washington, University of Virginia, Allen Institute for Artificial Intelligence arxiv Do Membership Inference Attacks Work on Large Language Models? Membership Inference Attacks&Privacy&Security
24.02 Pennsylvania State University, Wuhan University, Illinois Institute of Technology arxiv PoisonedRAG: Knowledge Poisoning Attacks to Retrieval-Augmented Generation of Large Language Models Knowledge Poisoning Attacks&Retrieval-Augmented Generation
24.02 Purdue University, University of Massachusetts at Amherst arxiv RAPID OPTIMIZATION FOR JAILBREAKING LLMS VIA SUBCONSCIOUS EXPLOITATION AND ECHOPRAXIA Jailbreaking LLM&Optimization
24.02 CISPA Helmholtz Center for Information Security arxiv Comprehensive Assessment of Jailbreak Attacks Against LLMs Jailbreak Attacks&Attack Methods&Policy Alignment
24.02 UC Berkeley arxiv StruQ: Defending Against Prompt Injection with Structured Queries Prompt Injection Attacks&Structured Queries&Defense Mechanisms
24.02 Nanyang Technological University, Huazhong University of Science and Technology, University of New South Wales arxiv PANDORA: Jailbreak GPTs by Retrieval Augmented Generation Poisoning Jailbreak Attacks&Retrieval Augmented Generation (RAG)
24.02 Sea AI Lab, Southern University arxiv Test-Time Backdoor Attacks on Multimodal Large Language Models Backdoor Attacks&Multimodal Large Language Models (MLLMs)&Adversarial Test Images
24.02 University of Illinois at Urbana–Champaign, University of California, San Diego, Allen Institute for AI arxiv COLD-Attack: Jailbreaking LLMs with Stealthiness and Controllability Jailbreaks&Controllable Attack Generation
24.02 ISCAS, NTU arxiv Play Guessing Game with LLM: Indirect Jailbreak Attack with Implicit Clues Jailbreak Attacks&Indirect Attack&Puzzler
24.02 École Polytechnique Fédérale de Lausanne, University of Wisconsin-Madison arxiv Leveraging the Context through Multi-Round Interactions for Jailbreaking Attacks Jailbreaking Attacks&Contextual Interaction&Multi-Round Interactions
24.02 University of Electronic Science and Technology of China, CISPA Helmholtz Center for Information Security, NetApp arxiv Rapid Adoption, Hidden Risks: The Dual Impact of Large Language Model Customization Customization&Instruction Backdoor Attacks&GPTs
24.02 Shanghai Artificial Intelligence Laboratory arxiv Attacks, Defenses and Evaluations for LLM Conversation Safety: A Survey LLM Conversation Safety&Attacks&Defenses
24.02 UC Berkeley, New York University arxiv PAL: Proxy-Guided Black-Box Attack on Large Language Models Black-Box Attack&Proxy-Guided Attack&PAL
24.02 Center for Human-Compatible AI, UC Berkeley arxiv A STRONGREJECT for Empty Jailbreaks Jailbreaks&Benchmarking&StrongREJECT
24.02 Arizona State University arxiv Jailbreaking Proprietary Large Language Models using Word Substitution Cipher Jailbreak&Word Substitution Cipher&Attack Success Rate
24.02 Renmin University of China, Beijing, Peking University, WeChat AI arxiv Watch Out for Your Agents! Investigating Backdoor Threats to LLM-Based Agents Backdoor Attacks&Agent Safety&Framework
24.02 University of Washington, UIUC, Western Washington University, University of Chicago arxiv ArtPrompt: ASCII Art-based Jailbreak Attacks against Aligned LLMs ASCII Art&Jailbreak Attacks&Safety Alignment
24.02 Jinan University, Nanyang Technological University, Zhejiang University, Hong Kong University of Science and Technology, Beijing Institute of Technology, Sony Research arxiv Defending Against Weight-Poisoning Backdoor Attacks for Parameter-Efficient Fine-Tuning Weight-Poisoning Backdoor Attacks&Parameter-Efficient Fine-Tuning (PEFT)&Poisoned Sample Identification Module (PSIM)
24.02 CISPA Helmholtz Center for Information Security arxiv Prompt Stealing Attacks Against Large Language Models Prompt Engineering&Security
24.02 University of New South Wales Australia, Delft University of Technology The Netherlands&Nanyang Technological University Singapore arxiv LLM Jailbreak Attack versus Defense Techniques - A Comprehensive Study Jailbreak Attacks&Defense Techniques
24.02 Wayne State University, University of Michigan-Flint arxiv Learning to Poison Large Language Models During Instruction Tuning Data Poisoning&Backdoor Attacks
24.02 Nanyang Technological University, Zhejiang University, The Chinese University of Hong Kong arxiv Backdoor Attacks on Dense Passage Retrievers for Disseminating Misinformation Dense Passage Retrieval&Backdoor Attacks&Misinformation
24.02 University of Michigan arxiv PRP: Propagating Universal Perturbations to Attack Large Language Model Guard-Rails Universal Adversarial Prefixes&Guard Models
24.02 Meta arxiv Rainbow Teaming: Open-Ended Generation of Diverse Adversarial Prompts Adversarial Prompts&Quality-Diversity&Safety
24.02 Fudan University arxiv CodeChameleon: Personalized Encryption Framework for Jailbreaking Large Language Models Personalized Encryption&Safety Mechanisms
24.02 Carnegie Mellon University arxiv Attacking LLM Watermarks by Exploiting Their Strengths LLM Watermarks&Adversarial Attacks
24.02 Beihang University arxiv From Noise to Clarity: Unraveling the Adversarial Suffix of Large Language Model Attacks via Translation of Text Embeddings Adversarial Suffix&Text Embedding Translation
24.02 University of Maryland College Park arxiv Fast Adversarial Attacks on Language Models In One GPU Minute Adversarial Attacks&BEAST&Computational Efficiency
24.02 Shanghai Artificial Intelligence Laboratory arxiv Attacks Defenses and Evaluations for LLM Conversation Safety: A Survey Conversation Safety&Survey
24.02 Beijing University of Posts and Telecommunications, University of Michigan arxiv Speak Out of Turn: Safety Vulnerability of Large Language Models in Multi-turn Dialogue Multi-turn Dialogue&Safety Vulnerability
24.02 University of California, The Hongkong University of Science and Technology, University of Maryland arxiv DrAttack: Prompt Decomposition and Reconstruction Makes Powerful LLM Jailbreakers Jailbreaking Attacks&Prompt Decomposition
24.02 Massachusetts Institute of Technology, MIT-IBM Watson AI Lab arxiv CURIOSITY-DRIVEN RED-TEAMING FOR LARGE LANGUAGE MODELS Curiosity-Driven Exploration&Red Teaming
24.02 SKLOIS Institute of Information Engineering Chinese Academy of Science, School of Cyber Security University of Chinese Academy of Sciences,Tsinghua University,RealAI arxiv Making Them Ask and Answer: Jailbreaking Large Language Models in Few Queries via Disguise and Reconstruction Jailbreaking&Large Language Models&Adversarial Attacks
24.03 Rice University, Samsung Electronics America arxiv LoRA-as-an-Attack! Piercing LLM Safety Under The Share-and-Play Scenario Low-Rank Adaptation (LoRA)&Backdoor Attacks&Model Security
24.03 The University of Hong Kong arxiv ImgTrojan: Jailbreaking Vision-Language Models with ONE Image Vision-Language Models&Data Poisoning&Jailbreaking Attack
24.03 SPRING Lab EPFL arxiv Neural Exec: Learning (and Learning from) Execution Triggers for Prompt Injection Attacks Prompt Injection Attacks&Optimization-Based Approach&Security
24.03 Shanghai University of Finance and Economics, Southern University of Science and Technology arxiv Tastle: Distract Large Language Models for Automatic Jailbreak Attack Jailbreak Attack&Black-box Framework
24.03 Google DeepMind, ETH Zurich, University of Washington, OpenAI, McGill University arxiv Stealing Part of a Production Language Model Model Stealing&Language Models&Security
24.03 University of Edinburgh arxiv Scaling Behavior of Machine Translation with Large Language Models under Prompt Injection Attacks Prompt Injection Attacks&Machine Translation&Inverse Scaling
24.03 Nanyang Technological University arxiv BADEDIT: BACKDOORING LARGE LANGUAGE MODELS BY MODEL EDITING Backdoor Attacks&Model Editing&Security
24.03 Fudan University, Shanghai AI Laboratory arxiv EasyJailbreak: A Unified Framework for Jailbreaking Large Language Models Jailbreak Attacks&Security&Framework
24.03 Shanghai Key Lab of Intelligent Information Processing, Fudan University, Shanghai Engineering Research Center of AI & Robotics arxiv Improving Adversarial Transferability of Visual-Language Pre-training Models through Collaborative Multimodal Interaction Vision-Language Pre-trained Model&Adversarial Transferability&Black-Box Attack
24.03 Microsoft arxiv Securing Large Language Models: Threats, Vulnerabilities, and Responsible Practices Security Risks&Vulnerabilities
24.03 Carnegie Mellon University arxiv Jailbreaking is Best Solved by Definition Jailbreak Attacks&Adaptive Attacks
24.03 ShanghaiTech University arxiv LinkPrompt: Natural and Universal Adversarial Attacks on Prompt-based Language Models Universal Adversarial Triggers&Prompt-based Learning&Natural Language Attack
24.03 Huazhong University of Science and Technology, Lehigh University, University of Notre Dame & Duke University arxiv Optimization-based Prompt Injection Attack to LLM-as-a-Judge Prompt Injection Attack&LLM-as-a-Judge&Optimization
24.03 Washington University in St. Louis, University of Wisconsin - Madison, John Burroughs School USENIX Security 2024 Don’t Listen To Me: Understanding and Exploring Jailbreak Prompts of Large Language Models Jailbreak Prompts&Security
24.03 School of Information Science and Technology, ShanghaiTech University NAACL2024 LinkPrompt: Natural and Universal Adversarial Attacks on Prompt-based Language Models Prompt-based Language Models&Universal Adversarial Triggers&Natural Language Attacks
24.04 University of Pennsylvania, ETH Zurich, EPFL, Sony AI arxiv JailbreakBench: An Open Robustness Benchmark for Jailbreaking Large Language Models Jailbreaking Attacks&Robustness Benchmark
24.04 Microsoft Azure, Microsoft, Microsoft Research arxiv The Crescendo Multi-Turn LLM Jailbreak Attack Jailbreak Attacks&Multi-Turn Interaction
24.04 EPFL arxiv Jailbreaking Leading Safety-Aligned LLMs with Simple Adaptive Attacks Adaptive Attacks&Jailbreaking
24.04 The Ohio State University, University of Wisconsin-Madison arxiv JailBreakV-28K: A Benchmark for Assessing the Robustness of MultiModal Large Language Models against Jailbreak Attacks Multimodal Large Language Models&Jailbreak Attacks&Benchmark
24.04 Enkrypt AI arxiv INCREASED LLM VULNERABILITIES FROM FINE-TUNING AND QUANTIZATION Fine-tuning&Quantization&LLM Vulnerabilities
24.04 The Pennsylvania State University, Carnegie Mellon University arxiv Hidden You Malicious Goal Into Benigh Narratives: Jailbreak Large Language Models through Logic Chain Injection LLM&Jailbreak&Prompt Injection
24.04 Technical University of Darmstadt, Google Research arxiv Best-of-Venom: Attacking RLHF by Injecting Poisoned Preference Data Reinforcement Learning from Human Feedback&Poisoned Preference Data&Language Model Security
24.04 Purdue University arxiv Rethinking How to Evaluate Language Model Jailbreak Jailbreak&Evaluation Metrics
24.04 Xi’an Jiaotong-Liverpool University, Rutgers University, University of Liverpool arxiv Goal-guided Generative Prompt Injection Attack on Large Language Models Prompt Injection&Robustness&Mahalanobis Distance
24.04 University of New Haven arxiv SANDWICH ATTACK: MULTI-LANGUAGE MIXTURE ADAPTIVE ATTACK ON LLMS Multi-language Mixture&Adaptive Attack&LLM Security
24.04 The Ohio State University arxiv AmpleGCG: Learning a Universal and Transferable Generative Model of Adversarial Suffixes for Jailbreaking Both Open and Closed LLMs Adversarial Suffix Generation
24.04 Renmin University of China, Microsoft Research arxiv Uncovering Safety Risks in Open-source LLMs through Concept Activation Vector Safety&Attack Methods
24.04 Zhejiang University arxiv JailbreakLens: Visual Analysis of Jailbreak Attacks Against Large Language Models Jailbreak Attacks&Visual Analytics
24.04 Shanghaitech University arxiv Don’t Say No: Jailbreaking LLM by Suppressing Refusal Jailbreaking Attacks&Adversarial Attacks
24.04 University of Electronic Science and Technology of China, Chengdu University of Technology arxiv TALK TOO MUCH: Poisoning Large Language Models under Token Limit Token Limitation&Poisoning Attack
24.04 ETH Zurich, EPFL, University of Twente, Georgia Institute of Technology arxiv Competition Report: Finding Universal Jailbreak Backdoors in Aligned LLMs Aligned LLMs&Universal Jailbreak Backdoors&Poisoning Attacks
24.04 N/A arxiv Trojan Detection in Large Language Models: Insights from The Trojan Detection Challenge Trojan Detection&Model Robustness
24.04 Shanghai Jiao Tong University arxiv Physical Backdoor Attack can Jeopardize Driving with Vision-Large-Language Models Vision-Large-Language Models&Autonomous Driving&Security
24.04 Max-Planck-Institute for Intelligent Systems, AI at Meta (FAIR) arxiv AdvPrompter: Fast Adaptive Adversarial Prompting for LLMs Adversarial Prompting&Safety in AI
24.04 Singapore Management University, Shanghai Institute for Advanced Study of Zhejiang University arxiv Evaluating and Mitigating Linguistic Discrimination in Large Language Models =Linguistic Discrimination&Jailbreak&Defense
24.04 University of Louisiana at Lafayette, Beijing Electronic Science and Technology Institute, The Johns Hopkins University arxiv Assessing Cybersecurity Vulnerabilities in Code Large Language Models Code LLMs
24.04 University College London, The University of Melbourne, Macquarie University, University of Edinburgh arxiv Transferring Troubles: Cross-Lingual Transferability of Backdoor Attacks in LLMs with Instruction Tuning Cross-Lingual Transferability&Backdoor Attacks&Instruction Tuning
24.04 University of Cambridge, Indian Institute of Technology Bombay, University of Melbourne, University College London, Macquarie University ICLR 2024 Workshop Attacks on Third-Party APIs of Large Language Models Third-Party API&Security
24.04 Purdue University, Fort Wayne NAACL2024 Vert Attack: Taking advantage of Text Classifiers’ horizontal vision Text Classifiers&Adversarial Attacks&VertAttack
24.04 The University of Melbourne&Macquarie University&University College London NAACL2024 Backdoor Attacks on Multilingual Machine Translation Multilingual Machine Translation&Security&Backdoor Attacks
24.05 Institute of Information Engineering, Chinese Academy of Sciences arxiv Can LLMs Deeply Detect Complex Malicious Queries? A Framework for Jailbreaking via Obfuscating Intent Prompt Jailbreak Attack&Red Team&Black-box Attack
24.05 University of Texas at Austin arxiv Mitigating Exaggerated Safety in Large Language Models Model Safety&Utility&Exaggerated Safety
24.05 Institute of Information Engineering, Chinese Academy of Sciences arxiv Chain of Attack: A Semantic-Driven Contextual Multi-Turn Attacker for LLM Multi-Turn Dialogue Attack&LLM Security&Semantic-Driven Contextual Attack
24.05 Peking University ICLR 2024 Workshop BOOSTING JAILBREAK ATTACK WITH MOMENTUM Jailbreak Attack&Momentum Method
24.05 École Polytechnique Fédérale de Lausanne ICML 2024 Revisiting character-level adversarial attacks Character-level Adversarial Attack&Robustness
24.05 Johns Hopkins University CCS 2024 PLeak: Prompt Leaking Attacks against Large Language Model Applications Prompt Leaking Attacks&Adversarial Queries
24.05 IT University of Copenhagen arxiv Hacc-Man: An Arcade Game for Jailbreaking LLMs creative problem solving&jailbreaking
24.05 The Hong Kong Polytechnic University arxiv No Two Devils Alike: Unveiling Distinct Mechanisms of Fine-tuning Attacks Fine-tuning Attacks&LLM Safeguarding&Mechanistic Interpretability
24.05 KAIST arxiv Automatic Jailbreaking of the Text-to-Image Generative AI Systems Jailbreaking&Text-to-Image&Generative AI
24.05 Fudan University arxiv White-box Multimodal Jailbreaks Against Large Vision-Language Models Fine-tuning Attacks&Multimodal Models&Adversarial Robustness
24.05 Singapore Management University arxiv Defending Large Language Models Against Jailbreak Attacks via Layer-specific Editing Jailbreak Attacks&Layer-specific Editing&LLM Safeguarding
24.05 Mila – Québec AI Institute arxiv Learning Diverse Attacks on Large Language Models for Robust Red-Teaming and Safety Tuning Red-Teaming&Safety Tuning&GFlowNet Fine-tuning
24.05 CISPA Helmholtz Center for Information Security arxiv Voice Jailbreak Attacks Against GPT-4o Jailbreak Attacks&Voice Mode&GPT-4o
24.05 Nanyang Technological University arxiv ART: Automatic Red-teaming for Text-to-Image Models to Protect Benign Users Red-Teaming&Text-to-Image Models&Generative AI Safety
24.05 Xidian University arxiv Efficient LLM-Jailbreaking by Introducing Visual Modality Jailbreaking&Multimodal Models&Visual Modality
24.05 Institute of Information Engineering, Chinese Academy of Sciences arxiv Context Injection Attacks on Large Language Models Context Injection Attacks&Misleading Context
24.05 University of Illinois at Urbana-Champaign arxiv Jailbreaking Large Language Models Against Moderation Guardrails via Cipher Characters Jailbreak&Moderation Guardrails&Cipher Characters
24.05 Northeastern University arxiv Phantom: General Trigger Attacks on Retrieval Augmented Language Generation Trigger Attacks&Retrieval Augmented Generation&Poisoning
24.05 Northwestern University arxiv Enhancing Jailbreak Attack Against Large Language Models through Silent Tokens Jailbreak Attack&Silent Tokens
24.05 Peking University arxiv Visual-RolePlay: Universal Jailbreak Attack on MultiModal Large Language Models via Role-playing Image Character Jailbreak Attack&MultiModal Large Language Models&Role-playing
24.05 Northwestern University arxiv Exploring Backdoor Attacks against Large Language Model-based Decision Making Backdoor Attacks&Decision Making
24.05 Beihang University arxiv Cross-Modality Jailbreak and Mismatched Attacks on Medical Multimodal Large Language Models Jailbreak&Multimodal Large Language Models&Medical Contexts
24.05 Harbin Institute of Technology arxiv Improved Generation of Adversarial Examples Against Safety-aligned LLMs Adversarial Examples&Safety-aligned LLMs&Gradient-based Methods
24.05 Nanyang Technological University arxiv Improved Techniques for Optimization-Based Jailbreaking on Large Language Models Jailbreaking&Optimization Techniques
24.06 University of Central Florida arxiv BadRAG: Identifying Vulnerabilities in Retrieval Augmented Generation of Large Language Models Retrieval-Augmented Generation&Poisoning Attacks
24.06 Zscaler, Inc. arxiv Exploring Vulnerabilities and Protections in Large Language Models: A Survey Prompt Hacking&Adversarial Attacks&Suvery
24.06 Singapore Management University arxiv Improved Few-Shot Jailbreaking Can Circumvent Aligned Language Models and Their Defenses Few-Shot Jailbreaking&Aligned Language Models&Adversarial Attacks
24.06 Capgemini Invent, Paris arxiv QROA: A Black-Box Query-Response Optimization Attack on LLMs Query-Response Optimization Attack&Black-Box
24.06 Huazhong University of Science and Technology arxiv AutoJailbreak: Exploring Jailbreak Attacks and Defenses through a Dependency Lens Jailbreak Attacks&Dependency Analysis
24.06 Beihang University arxiv Jailbreak Vision Language Models via Bi-Modal Adversarial Prompt Jailbreak Attacks&Vision Language Models&Bi-Modal Adversarial Prompt
24.06 Zhengzhou University ACL 2024 BadAgent: Inserting and Activating Backdoor Attacks in LLM Agents Backdoor Attacks&LLM Agents&Data Poisoning
24.06 Ludwig-Maximilians-University arxiv Understanding Jailbreak Success: A Study of Latent Space Dynamics in Large Language Models Jailbreak Success&Latent Space Dynamics
24.06 Alibaba Group arxiv How Alignment and Jailbreak Work: Explain LLM Safety through Intermediate Hidden States LLM Safety&Alignment&Jailbreak
24.06 Beihang University arxiv Unveiling the Safety of GPT-4O: An Empirical Study Using Jailbreak Attacks GPT-4O&Jailbreak Attacks&Safety Evaluation
24.06 Nanyang Technological University arxiv A Survey of Backdoor Attacks and Defenses on Large Language Models: Implications for Security Measures Backdoor Attacks&Defenses&Survey
24.06 Anomalee Inc. arxiv On Trojans in Refined Language Models Trojans&Refined Language Models&Data Poisoning
24.06 Purdue University arxiv When LLM Meets DRL: Advancing Jailbreaking Efficiency via DRL-Guided Search Jailbreaking&Deep Reinforcement Learning
24.06 Xidian University arxiv StructuralSleight: Automated Jailbreak Attacks on Large Language Models Utilizing Uncommon Text-Encoded Structure Jailbreak Attacks&StructuralSleight
24.06 Tsinghua University arxiv JailbreakEval: An Integrated Toolkit for Evaluating Jailbreak Attempts Against Large Language Models Jailbreak Attempts&Evaluation Toolkit
24.06 The Hong Kong University of Science and Technology (Guangzhou) arxiv Bag of Tricks: Benchmarking of Jailbreak Attacks on LLMs Jailbreak Attacks&Benchmarking
24.06 Pennsylvania State University NAACL 2024 PromptFix: Few-shot Backdoor Removal via Adversarial Prompt Tuning Backdoor Removal&Adversarial Prompt Tuning&Few-shot Learning
24.06 Shanghai Jiao Tong University, Peking University, Shanghai AI Laboratory arxiv Emerging Safety Attack and Defense in Federated Instruction Tuning of Large Language Models Federated Instruction Tuning&Safety Attack&Defense
24.06 Michigan State University arxiv Towards Understanding Jailbreak Attacks in LLMs: A Representation Space Analysis Jailbreak Attacks&Representation Space Analysis
24.06 Chinese Academy of Sciences arxiv “Not Aligned” is Not “Malicious”: Being Careful about Hallucinations of Large Language Models’ Jailbreak Jailbreak&Hallucinations&LLMs
24.06 Tsinghua University arxiv Knowledge-to-Jailbreak: One Knowledge Point Worth One Attack Knowledge-to-Jailbreak&Jailbreak Attacks&Domain-Specific Safety
24.06 University of Maryland arxiv Is Poisoning a Real Threat to LLM Alignment? Maybe More So Than You Think Poisoning Attacks&Direct Policy Optimization&Reinforcement Learning with Human Feedback
24.06 Carnegie Mellon University arxiv Jailbreak Paradox: The Achilles’ Heel of LLMs Jailbreak Paradox&Security
24.06 Carnegie Mellon University arxiv Adversarial Attacks on Multimodal Agents Adversarial Attacks&Multimodal Agents&Vision-Language Models
24.06 University of Washington, Allen Institute for AI arxiv ChatBug: A Common Vulnerability of Aligned LLMs Induced by Chat Templates LLM Vulnerabilities&Jailbreak Attacks&Adversarial Training
24.06 University of Notre Dame, Huazhong University of Science and Technology, Tsinghua University, Lehigh University arxiv ObscurePrompt: Jailbreaking Large Language Models via Obscure Input Jailbreaking&Adversarial Attacks&Out-of-Distribution Data
24.06 The University of Hong Kong, Huawei Noah’s Ark Lab arxiv Jailbreaking as a Reward Misspecification Problem Jailbreaking&Reward Misspecification&Adversarial Attacks
24.06 UC Berkeley arxiv Adversaries Can Misuse Combinations of Safe Models Model Misuse&AI Safety&Task Decomposition
24.06 UC Santa Barbara arxiv MultiAgent Collaboration Attack: Investigating Adversarial Attacks in Large Language Model Collaborations MultiAgent Collaboration&Adversarial Attacks
24.06 University of Southern California arxiv From LLMs to MLLMs: Exploring the Landscape of Multimodal Jailbreaking Multimodal Jailbreaking&MLLMs&Security
24.06 KAIST arxiv CSRT: Evaluation and Analysis of LLMs using Code-Switching Red-Teaming Dataset Code-Switching&Red-Teaming&Multilingualism
24.06 China University of Geosciences arxiv Large Language Models for Link Stealing Attacks Against Graph Neural Networks Link Stealing Attacks&Graph Neural Networks&Privacy Attacks
24.06 The Hong Kong Polytechnic University arxiv CoSafe: Evaluating Large Language Model Safety in Multi-Turn Dialogue Coreference Safety Evaluation&Dialogue Coreference&LLM Safety
24.06 Imperial College London arxiv Inherent Challenges of Post-Hoc Membership Inference for Large Language Models Membership Inference Attacks&Post-Hoc Evaluation&Distribution Shift
24.06 Hubei University arxiv Poisoned LangChain: Jailbreak LLMs by LangChain Jailbreak&Retrieval-Augmented Generation&LangChain
24.06 University of Central Florida arxiv Jailbreaking LLMs with Arabic Transliteration and Arabizi Jailbreaking&Arabic Transliteration&Arabizi
24.06 Hubei University TRAC 2024 Workshop SEEING IS BELIEVING: BLACK-BOX MEMBERSHIP INFERENCE ATTACKS AGAINST RETRIEVAL AUGMENTED GENERATION Membership Inference Attacks&Retrieval-Augmented Generation
24.06 Huazhong University of Science and Technology arxiv Virtual Context: Enhancing Jailbreak Attacks with Special Token Injection Jailbreak Attacks&Special Tokens
24.06 UC Berkeley arxiv Covert Malicious Finetuning: Challenges in Safeguarding LLM Adaptation AI Safety&Backdoors
24.07 University of Illinois Chicago arxiv Large Language Models Are Involuntary Truth-Tellers: Exploiting Fallacy Failure for Jailbreak Attacks Jailbreak Attacks&Fallacious Reasoning
24.07 Palisade Research arxiv Badllama 3: Removing Safety Finetuning from Llama 3 in Minutes Safety Finetuning&Jailbreak Attacks
24.07 University of Illinois Urbana-Champaign arxiv JailbreakZoo: Survey, Landscapes, and Horizons in Jailbreaking Large Language and Vision-Language Models Jailbreaking&Vision-Language Models
24.07 Shanghai University of Finance and Economics arxiv SoP: Unlock the Power of Social Facilitation for Automatic Jailbreak Attack Jailbreak Attacks&Large Language Models&Social Facilitation
24.07 University of Exeter arxiv Image-to-Text Logic Jailbreak: Your Imagination can Help You Do Anything Machine Learning&ICML&Jailbreak Attacks
24.07 Hong Kong University of Science and Technology arxiv JailbreakHunter: A Visual Analytics Approach for Jailbreak Prompts Discovery from Large-Scale Human-LLM Conversational Datasets Visual Analytics&Jailbreak Prompts
24.07 CISPA Helmholtz Center for Information Security arxiv SOS! Soft Prompt Attack Against Open-Source Large Language Models Soft Prompt Attack&Open-Source Models
24.07 National University of Singapore arxiv Single Character Perturbations Break LLM Alignment Jailbreak Attacks&Model Alignment
24.07 Deutsches Forschungszentrum für Künstliche Intelligenz arxiv Soft Begging: Modular and Efficient Shielding of LLMs against Prompt Injection and Jailbreaking based on Prompt Tuning Prompt Injection&Jailbreaking&Soft Prompts
24.07 UC Davis arxiv Securing Multi-turn Conversational Language Models Against Distributed Backdoor Triggers Multi-turn Conversation&Backdoor Triggers&LLM Security
24.07 Tsinghua University arxiv Jailbreak Attacks and Defenses Against Large Language Models: A Survey Jailbreak Attacks&Defenses
24.07 Zhejiang University arxiv TAPI: Towards Target-Specific and Adversarial Prompt Injection against Code LLMs Target-Specific Attacks&Adversarial Prompt Injection&Malicious Code Generation
24.07 Northwestern University arxiv CEIPA: Counterfactual Explainable Incremental Prompt Attack Analysis on Large Language Models Counterfactual Explanation&Prompt Attack Analysis&Incremental Prompt Injection
24.07 EPFL arxiv Does Refusal Training in LLMs Generalize to the Past Tense? Refusal Training&Past Tense Reformulation&Adversarial Attacks
24.07 University of Chicago arxiv AGENTPOISON: Red-teaming LLM Agents via Poisoning Memory or Knowledge Bases Red-teaming&LLM Agents&Poisoning
24.07 Wuhan University arxiv Black-Box Opinion Manipulation Attacks to Retrieval-Augmented Generation of Large Language Models Black-box Attacks&RAG&Opinion Manipulation
24.07 University of New South Wales arxiv Continuous Embedding Attacks via Clipped Inputs in Jailbreaking Large Language Models Continuous Embedding&Jailbreaking
24.07 Bloomberg arxiv Operationalizing a Threat Model for Red-Teaming Large Language Models (LLMs) Threat Model&Red-Teaming
24.07 Stanford University arxiv When Do Universal Image Jailbreaks Transfer Between Vision-Language Models? Universal Image Jailbreaks&Vision-Language Models&Transferability
24.07 Michigan State University arxiv Intrinsic Self-correction for Enhanced Morality: An Analysis of Internal Mechanisms and the Superficial Hypothesis Moral Self-Correction&Intrinsic Mechanisms
24.07 Meetyou AI Lab arxiv Imposter.AI: Adversarial Attacks with Hidden Intentions towards Aligned Large Language Models Adversarial Attacks&Hidden Intentions
24.07 Zhejiang Gongshang University arxiv Figure it Out: Analyzing-based Jailbreak Attack on Large Language Models Jailbreak Attack&Analyzing-based Jailbreak
24.07 Zhejiang University arxiv RedAgent: Red Teaming Large Language Models with Context-aware Autonomous Language Agent Red Teaming&Jailbreak Attacks&Context-aware Prompts
24.07 Confirm Labs arxiv Fluent Student-Teacher Redteaming Fluent Student-Teacher Redteaming&Adversarial Attacks
24.07 City University of Hong Kong ACM MM 2024 Arondight: Red Teaming Large Vision Language Models with Auto-generated Multi-modal Jailbreak Prompts Large Vision Language Model&Red Teaming&Jailbreak Attack
24.07 Huazhong University of Science and Technology NAACL 2024 Workshop Can Large Language Models Automatically Jailbreak GPT-4V? Jailbreak&Multimodal Information&Facial Recognition
24.07 Illinois Institute of Technology arxiv Can Editing LLMs Inject Harm? Knowledge Editing&Misinformation Injection&Bias Injection
24.07 KAIST arxiv Prompt-Driven Contrastive Learning for Transferable Adversarial Attacks Adversarial Attack&Vision-Language Model&Contrastive Learning
24.07 CISPA Helmholtz Center for Information Security arxiv Breaking Agents: Compromising Autonomous LLM Agents Through Malfunction Amplification LLM Agents&Security Vulnerability&Autonomous Systems
24.08 CISPA Helmholtz Center for Information Security arxiv Vera Verto: Multimodal Hijacking Attack Multimodal Hijacking Attack&Model Hijacking
24.08 Shandong University arxiv Jailbreaking Text-to-Image Models with LLM-Based Agents Jailbreak Attacks&Vision-Language Models (VLMs)&Generative AI Safety
24.08 Technological University Dublin arxiv Pathway to Secure and Trustworthy 6G for LLMs: Attacks, Defense, and Opportunities 6G Networks&Security&Membership Inference Attacks
24.08 Microsoft arxiv WHITE PAPER: A Brief Exploration of Data Exfiltration using GCG Suffixes Cross-Prompt Injection Attack&Greedy Coordinate Gradient&Data Exfiltration
24.08 NYU & Meta AI, FAIR arxiv Mission Impossible: A Statistical Perspective on Jailbreaking LLMs Jailbreaking&Reinforcement Learning with Human Feedback
24.08 Beihang University arxiv Compromising Embodied Agents with Contextual Backdoor Attacks Embodied Agents&Contextual Backdoor Attacks&Adversarial In-Context Generation
24.08 FAR AI arxiv Scaling Laws for Data Poisoning in LLMs Data Poisoning&Scaling Laws
24.08 The University of Western Australia arxiv A Study on Prompt Injection Attack Against LLM-Integrated Mobile Robotic Systems Mobile Robot&Prompt Injection
24.08 Fudan University arxiv EnJa: Ensemble Jailbreak on Large Language Models Jailbreaking&Security
24.08 Bocconi University arxiv Compromesso! Italian Many-Shot Jailbreaks Undermine the Safety of Large Language Models Jailbreaking&Multilingual Safety
24.08 Xidian University arxiv Multi-Turn Context Jailbreak Attack on Large Language Models From First Principles Multi-Turn Jailbreak Attack&Contextual Fusion Attack
24.08 Cornell Tech arxiv A Jailbroken GenAI Model Can Cause Substantial Harm: GenAI-powered Applications are Vulnerable to PromptWares Jailbroken GenAI Models&PromptWares&GenAI-powered Applications
24.08 University of California Irvine arxiv Using Retriever Augmented Large Language Models for Attack Graph Generation Retriever Augmented Generation&Attack Graphs&Cybersecurity
24.08 University of California, Los Angeles CCS 2024 BadMerging: Backdoor Attacks Against Model Merging Backdoor Attack&Model Merging&AI Security
24.08 Stanford University arxiv Kov: Transferable and Naturalistic Black-Box LLM Attacks using Markov Decision Processes and Tree Search Black-Box Attacks&Markov Decision Processes&Monte Carlo Tree Search
24.08 Shanghai Jiao Tong University arxiv Transferring Backdoors between Large Language Models by Knowledge Distillation Backdoor Attacks&Knowledge Distillation
24.08 Tsinghua University arxiv Probing the Safety Response Boundary of Large Language Models via Unsafe Decoding Path Generation Safety Response Boundary&Unsafe Decoding Path
24.08 Singapore University of Technology and Design arxiv FERRET: Faster and Effective Automated Red Teaming with Reward-Based Scoring Technique Automated Red Teaming&Adversarial Prompts&Reward-Based Scoring
24.08 Shanghai Jiao Tong University arxiv MEGen: Generative Backdoor in Large Language Models via Model Editing Backdoor Attacks&Model Editing
24.08 Chinese Academy of Sciences arxiv DiffZOO: A Purely Query-Based Black-Box Attack for Red-Teaming Text-to-Image Generative Model via Zeroth Order Optimization Black-Box Attack&Text-to-Image Generative Model&Zeroth Order Optimization
24.08 The Pennsylvania State University arxiv Hide Your Malicious Goal Into Benign Narratives: Jailbreak Large Language Models through Neural Carrier Articles Jailbreak Attacks&Prompt Injection
24.08 Xi'an Jiaotong University arxiv Unlocking Adversarial Suffix Optimization Without Affirmative Phrases: Efficient Black-box Jailbreaking via LLM as Optimizer Jailbreak Attacks&Adversarial Suffixes
24.08 Nanjing University of Information Science and Technology arxiv Large Language Models are Good Attackers: Efficient and Stealthy Textual Backdoor Attacks Textual Backdoor Attacks&Sample Selection
24.08 Nankai University arxiv RT-Attack: Jailbreaking Text-to-Image Models via Random Token Jailbreak&Text-to-Image&Adversarial Attacks
24.08 Harbin Institute of Technology, Shenzhen arxiv TF-Attack: Transferable and Fast Adversarial Attacks on Large Language Models Adversarial Attack&Transferability&Efficiency
24.08 Shenzhen Research Institute of Big Data arxiv Detecting AI Flaws: Target-Driven Attacks on Internal Faults in Language Models Target-Driven Attacks&Internal Faults&Reinforcement Learning
24.08 National University of Singapore arxiv Advancing Adversarial Suffix Transfer Learning on Aligned Large Language Models Adversarial Suffixes&Transfer Learning&Jailbreak
24.08 Scale AI arxiv LLM Defenses Are Not Robust to Multi-Turn Human Jailbreaks Yet Multi-Turn Jailbreaks&LLM Defense&Human Red Teaming
24.09 University of California, Berkeley arxiv Emerging Vulnerabilities in Frontier Models: Multi-Turn Jailbreak Attacks Multi-Turn Jailbreak&Frontier Models&LLM Security
24.09 University of Southern California arxiv Rethinking Backdoor Detection Evaluation for Language Models Backdoor Attacks&Detection Robustness&Training Intensity
24.09 Michigan State University arxiv The Dark Side of Human Feedback: Poisoning Large Language Models via User Inputs User-Guided Poisoning&RLHF&Toxicity Manipulation
24.09 University of Cambridge arxiv Conversational Complexity for Assessing Risk in Large Language Models Conversational Complexity&Risk Assessment
24.09 Independent arxiv Well, that escalated quickly: The Single-Turn Crescendo Attack (STCA) Single-Turn Crescendo Attack&Adversarial Attacks
24.09 CISPA Helmholtz Center for Information Security CCS 2024 Membership Inference Attacks Against In-Context Learning Membership Inference Attacks&In-Context Learning
24.09 Radboud University, Ikerlan Research Centre arxiv Context is the Key: Backdoor Attacks for In-Context Learning with Vision Transformers Backdoor Attacks&In-Context Learning&Vision Transformers
24.09 Institute of Information Engineering, Chinese Academy of Sciences arxiv AdaPPA: Adaptive Position Pre-Fill Jailbreak Attack Approach Targeting LLMs Jailbreak Attacks&Adaptive Position Pre-Fill
24.09 Technion - Israel Institute of Technology, Intuit, Cornell Tech arxiv Unleashing Worms and Extracting Data: Escalating the Outcome of Attacks against RAG-based Inference in Scale and Severity Using Jailbreaking Jailbreaking&RAG Inference&Data Extraction
24.09 University of Texas at San Antonio arxiv Jailbreaking Large Language Models with Symbolic Mathematics Jailbreaking&Symbolic Mathematics
24.09 Beihang University arxiv PathSeeker: Exploring LLM Security Vulnerabilities with a Reinforcement Learning-Based Jailbreak Approach LLM Security Vulnerabilities&Jailbreak Attack&Reinforcement Learning
24.09 AWS AI arxiv Order of Magnitude Speedups for LLM Membership Inference Membership Inference&Quantile Regression
24.09 Nanyang Technological University arxiv Effective and Evasive Fuzz Testing-Driven Jailbreaking Attacks against LLMs Jailbreaking Attacks&Fuzz Testing&LLM Security
24.09 Hippocratic AI arxiv RED QUEEN: Safeguarding Large Language Models against Concealed Multi-Turn Jailbreaking Jailbreaking&Multi-Turn Attacks&Concealed Attacks
24.09 Nanyang Technological University arxiv Weak-to-Strong Backdoor Attacks for LLMs with Contrastive Knowledge Distillation Backdoor Attacks&Contrastive Knowledge Distillation&Parameter-Efficient Fine-Tuning
24.09 Georgia Institute of Technology arxiv Harmful Fine-tuning Attacks and Defenses for Large Language Models: A Survey Harmful Fine-tuning&LLM Attacks&LLM Defenses
24.09 Institut Polytechnique de Paris arxiv Read Over the Lines: Attacking LLMs and Toxicity Detection Systems with ASCII Art to Mask Profanity ASCII Art&LLM Attacks&Toxicity Detection
24.09 LMU Munich arxiv Multimodal Pragmatic Jailbreak on Text-to-image Models Multimodal Pragmatic Jailbreak&Text-to-image Models&Safety Filters
24.10 Stony Brook University arxiv BUCKLE UP: ROBUSTIFYING LLMS AT EVERY CUSTOMIZATION STAGE VIA DATA CURATION Jailbreaking&LLM Customization&Data Curation
24.10 National University of Singapore arxiv FLIPATTACK: Jailbreak LLMs via Flipping Jailbreak&Adversarial Attacks
24.10 University of Wisconsin–Madison, NVIDIA arxiv AUTODAN-TURBO: A Lifelong Agent for Strategy Self-Exploration to Jailbreak LLMs Jailbreak&Strategy Self-Exploration
24.10 University College London, Stanford University arxiv Prompt Infection: LLM-to-LLM Prompt Injection within Multi-Agent Systems Prompt Infection&Multi-Agent Systems
24.10 University of Wisconsin–Madison arxiv RePD: Defending Jailbreak Attack through a Retrieval-based Prompt Decomposition Process Jailbreak attack&Prompt decomposition&LLMs defense
24.10 UC Santa Cruz, Johns Hopkins University, University of Edinburgh, Peking University arxiv AttnGCG: Enhancing Jailbreaking Attacks on LLMs with Attention Manipulation Jailbreaking&LLMs vulnerability&Optimization-based attacks
24.10 Independent Researcher arxiv Recent advancements in LLM Red-Teaming: Techniques, Defenses, and Ethical Considerations LLM red-teaming&Jailbreaking defenses&Prompt engineering
24.10 Beihang University, Tsinghua University, Peking University arxiv BLACKDAN: A Black-Box Multi-Objective Approach for Effective and Contextual Jailbreaking of Large Language Models Jailbreak&Multi-objective optimization&Black-box attack
24.10 Shanghai Jiao Tong University, Shanghai Artificial Intelligence Laboratory, Beihang University arxiv Derail Yourself: Multi-Turn LLM Jailbreak Attack Through Self-Discovered Clues Multi-turn attacks&Jailbreak&Self-discovered clues
24.10 Tsinghua University, Sea AI Lab, Peng Cheng Laboratory arxiv Denial-of-Service Poisoning Attacks on Large Language Models Denial-of-Service&Poisoning attack
24.10 University of New Haven, Robust Intelligence arxiv COGNITIVE OVERLOAD ATTACK: PROMPT INJECTION FOR LONG CONTEXT Cognitive overload&Prompt injection&Jailbreak
24.10 Harbin Institute of Technology, Tencent, University of Glasgow, Independent Researcher arxiv DECIPHERING THE CHAOS: ENHANCING JAILBREAK ATTACKS VIA ADVERSARIAL PROMPT TRANSLATION Jailbreak attacks&Adversarial prompt&Gradient-based optimization
24.10 Monash University arxiv JIGSAW PUZZLES: Splitting Harmful Questions to Jailbreak Large Language Models Jailbreak&Multi-turn attack&Query splitting
24.10 Wuhan University arxiv Multi-Round Jailbreak Attack on Large Language Models Jailbreak&Multi-round attack
24.10 The Hong Kong University of Science and Technology (Guangzhou), University of Birmingham, Baidu Inc. arxiv JAILJUDGE: A Comprehensive Jailbreak Judge Benchmark with Multi-Agent Enhanced Explanation Evaluation Framework Jailbreak judge&Multi-agent framework
24.10 Theori Inc. ICLR 2025 DO LLMS HAVE POLITICAL CORRECTNESS? Analyzing Ethical Biases and Jailbreak Vulnerabilities in AI Systems Political correctness&Jailbreak&Ethical vulnerabilities
24.10 University of Manitoba arxiv SoK: Prompt Hacking of Large Language Models Prompt Hacking&Jailbreak Attacks
24.10 Thales DIS arxiv Backdoored Retrievers for Prompt Injection Attacks on Retrieval-Augmented Generation of Large Language Models Retrieval-Augmented Generation&Prompt Injection&Backdoor Attacks
24.10 Duke University arxiv Making LLMs Vulnerable to Prompt Injection via Poisoning Alignment Prompt Injection&Poisoning Alignment&LLM Vulnerabilities
24.10 Tsinghua University arxiv Faster-GCG: Efficient Discrete Optimization Jailbreak Attacks against Aligned Large Language Models Jailbreak Attacks&Discrete Optimization&Adversarial Attacks
24.10 International Digital Economy Academy arxiv SMILES-Prompting: A Novel Approach to LLM Jailbreak Attacks in Chemical Synthesis SMILES-Prompting&Jailbreak Attacks&Chemical Safety
24.10 Beijing University of Posts and Telecommunications arxiv FEINT AND ATTACK: Attention-Based Strategies for Jailbreaking and Protecting LLMs Attention Mechanisms&Jailbreak Attacks&Defense Strategies
24.10 IBM Research AI arxiv Breaking ReAct Agents: Foot-in-the-Door Attack Will Get You In ReAct Agents&Prompt Injection&Foot-in-the-Door Attack
24.10 Google DeepMind arxiv Remote Timing Attacks on Efficient Language Model Inference Timing Attacks&Efficient Inference&Privacy
24.10 Meta arxiv Towards Understanding the Fragility of Multilingual LLMs against Fine-Tuning Attacks Fine-Tuning Attacks&Multilingual LLMs&Safety Alignment
24.10 University of California San Diego arxiv Iterative Self-Tuning LLMs for Enhanced Jailbreaking Capabilities Jailbreaking&LLM Vulnerabilities&Adversarial Attacks
24.10 Florida State University arxiv Adversarial Attacks on Large Language Models Using Regularized Relaxation Adversarial Attacks&Continuous Optimization
24.10 The Pennsylvania State University arxiv Humanizing the Machine: Proxy Attacks to Mislead LLM Detectors Adversarial Attacks&Detection Evasion
24.10 Nanyang Technological University arxiv Mask-based Membership Inference Attacks for Retrieval-Augmented Generation Retrieval-Augmented Generation&Membership Inference Attacks&Privacy
24.10 George Mason University arxiv Hacking Back the AI-Hacker: Prompt Injection as a Defense Against LLM-driven Cyberattacks Prompt Injection Defense&LLM Cybersecurity&Adversarial Inputs
24.10 Fudan University arxiv BlueSuffix: Reinforced Blue Teaming for Vision-Language Models Against Jailbreak Attacks Jailbreak Defense&Vision-Language Models&Reinforcement Learning
24.10 Harbin Institute of Technology arxiv Stealthy Jailbreak Attacks on Large Language Models via Benign Data Mirroring Jailbreak Attacks&Adversarial Prompts
24.10 SRM Institute of Science and Technology arxiv Palisade - Prompt Injection Detection Framework Prompt Injection&Heuristic-based Detection
24.10 The Ohio State University arxiv AmpleGCG-Plus: A Strong Generative Model of Adversarial Suffixes to Jailbreak LLMs with Higher Success Rates in Fewer Attempts Jailbreak Attacks&Adversarial Suffixes&Generative Models
24.10 Zhejiang University arxiv HIJACKRAG: Hijacking Attacks against Retrieval-Augmented Large Language Models Retrieval-Augmented Generation&Prompt Injection Attacks&Security Vulnerability
24.10 The Baldwin School arxiv Systematically Analyzing Prompt Injection Vulnerabilities in Diverse LLM Architectures Prompt Injection Vulnerabilities&Model Susceptibility
24.10 Competition for LLM and Agent Safety 2024 arxiv Transferable Ensemble Black-box Jailbreak Attacks on Large Language Models Black-box Jailbreak Attacks&Ensemble Methods
24.10 University of Electronic Science and Technology of China arxiv Pseudo-Conversation Injection for LLM Goal Hijacking Goal Hijacking&Prompt Injection
24.10 Monash University arxiv Audio Is the Achilles’ Heel: Red Teaming Audio Large Multimodal Models Audio Multimodal Models&Safety Vulnerabilities&Jailbreak Attacks
24.11 Fudan University arxiv IDEATOR: Jailbreaking VLMs Using VLMs Vision-Language Models&Jailbreak Attack&Multimodal Safety
24.11 International Computer Science Institute arxiv Emoji Attack: A Method for Misleading Judge LLMs in Safety Risk Detection Emoji Attack&Jailbreaking&Judge LLMs Bias
24.11 Peking University arxiv B4: A Black-Box ScruBBing Attack on LLM Watermarks Black-Box Attack&Watermark Removal&Adversarial Text Generation
24.11 University of Science and Technology of China arxiv SQL Injection Jailbreak: a structural disaster of large language models SQL Injection&Jailbreak Attack&LLM Vulnerability
24.11 University of Illinois Urbana-Champaign arxiv Stochastic Monkeys at Play: Random Augmentations Cheaply Break LLM Safety Alignment Random Augmentations&Safety Alignment&LLM Jailbreak
24.11 Cambridge ERA: AI Fellowship arxiv What Features in Prompts Jailbreak LLMs? Investigating the Mechanisms Behind Attacks Jailbreak Prompts&Nonlinear Probes&Adversarial Attacks
24.11 Alibaba Group arxiv MRJ-Agent: An Effective Jailbreak Agent for Multi-Round Dialogue Multi-Round Dialogue&Jailbreak Agent&LLM Vulnerability
24.11 Columbia University arxiv Diversity Helps Jailbreak Large Language Models Jailbreak Techniques&LLM Safety&Prompt Diversity
24.11 Bangladesh University of Engineering and Technology arXiv SequentialBreak: Large Language Models Can Be Fooled by Embedding Jailbreak Prompts into Sequential Prompt Chains Jailbreak Attacks&Prompt Engineering&LLM Vulnerabilities
24.11 Xi'an Jiaotong-Liverpool University arXiv Target-driven Attack for Large Language Models Black-box Attacks&Optimization Methods
24.11 Georgia Institute of Technology arXiv LLM STINGER: Jailbreaking LLMs using RL Fine-tuned LLMs Jailbreaking Attacks&Reinforcement Learning&Adversarial Suffixes
24.11 Beijing University of Posts and Telecommunications arXiv Jailbreak Attacks and Defenses against Multimodal Generative Models: A Survey Jailbreak Attacks&Multimodal Generative Models&Security Challenges
24.11 Arizona State University NeurIPS 2024 SafeGenAI Workshop Zer0-Jack: A Memory-efficient Gradient-based Jailbreaking Method for Black-box Multi-modal Large Language Models Black-box Jailbreaking&Multi-modal Models&Zeroth-order Optimization
24.11 Zhejiang University arxiv JailbreakLens: Interpreting Jailbreak Mechanism in the Lens of Representation and Circuit Jailbreak Attacks&Large Language Models&Mechanism Interpretability
24.11 University of Electronic Science and Technology of China arxiv Safe + Safe = Unsafe? Exploring How Safe Images Can Be Exploited to Jailbreak Large Vision-Language Models Jailbreaking&Large Vision-Language Models&Safety Snowball Effect
24.11 Tsinghua University arxiv Playing Language Game with LLMs Leads to Jailbreaking Jailbreaking&Language Games&LLM Safety
24.11 University of Texas at Dallas arxiv AttentionBreaker: Adaptive Evolutionary Optimization for Unmasking Vulnerabilities in LLMs through Bit-Flip Attacks Bit-Flip Attacks&Model Vulnerability Optimization
24.11 BITS Pilani arxiv GASP: Efficient Black-Box Generation of Adversarial Suffixes for Jailbreaking LLMs Jailbreaking&Latent Bayesian Optimization&Adversarial Prompts
24.11 Nanyang Technological University, Wuhan University arxiv Neutralizing Backdoors through Information Conflicts for Large Language Models Backdoor Defense&Information Conflicts&Model Security
24.11 Duke University, University of Louisville arxiv LoBAM: LoRA-Based Backdoor Attack on Model Merging Model Merging&Backdoor Attack&LoRA
24.11 Université de Sherbrooke&University of Kinshasa arxiv Preventing Jailbreak Prompts as Malicious Tools for Cybercriminals: A Cyber Defense Perspective Jailbreak Prompts&Cyber Defense&AI Security

💻Presentations & Talks

📖Tutorials & Workshops

Date Type Title URL
23.01 Community Reddit/ChatGPTJailbrek link
23.02 Resource&Tutorials Jailbreak Chat link
23.10 Tutorials Awesome-LLM-Safety link
23.10 Article Adversarial Attacks on LLMs(Author: Lilian Weng) link
23.11 Video [1hr Talk] Intro to Large Language Models
From 45:45(Author: Andrej Karpathy)
link

📰News & Articles

Date Type Title Author URL
23.10 Article Adversarial Attacks on LLMs Lilian Weng link

🧑‍🏫Scholars