Adversarial Confusion Attack: Disrupting Multimodal Large Language Models
Abstract
The Adversarial Confusion Attack targets multimodal large language models to induce systematic disruption, leading to incoherent or confidently incorrect outputs, using a small ensemble and basic adversarial techniques.
We introduce the Adversarial Confusion Attack, a new class of threats against multimodal large language models (MLLMs). Unlike jailbreaks or targeted misclassification, the goal is to induce systematic disruption that makes the model generate incoherent or confidently incorrect outputs. Practical applications include embedding such adversarial images into websites to prevent MLLM-powered AI Agents from operating reliably. The proposed attack maximizes next-token entropy using a small ensemble of open-source MLLMs. In the white-box setting, we show that a single adversarial image can disrupt all models in the ensemble, both in the full-image and Adversarial CAPTCHA settings. Despite relying on a basic adversarial technique (PGD), the attack generates perturbations that transfer to both unseen open-source (e.g., Qwen3-VL) and proprietary (e.g., GPT-5.1) models.
Community
We introduce the Adversarial Confusion Attack as a new mechanism for protecting websites from MLLM-powered AI Agents. Embedding these “Adversarial CAPTCHAs” into web content pushes models into systemic decoding failures, from confident hallucinations to full incoherence. The perturbations disrupt all white-box models we test and transfer to proprietary systems like GPT-5 in the full-image setting. Technically, the attack uses PGD to maximize next-token entropy across a small surrogate ensemble of MLLMs.
This is an automated message from the Librarian Bot. I found the following papers similar to this paper.
The following papers were recommended by the Semantic Scholar API
- MS-GAGA: Metric-Selective Guided Adversarial Generation Attack (2025)
- SmoothGuard: Defending Multimodal Large Language Models with Noise Perturbation and Clustering Aggregation (2025)
- When Alignment Fails: Multimodal Adversarial Attacks on Vision-Language-Action Models (2025)
- V-Attack: Targeting Disentangled Value Features for Controllable Adversarial Attacks on LVLMs (2025)
- Defense That Attacks: How Robust Models Become Better Attackers (2025)
- Attention-Guided Patch-Wise Sparse Adversarial Attacks on Vision-Language-Action Models (2025)
- Black-box Optimization of LLM Outputs by Asking for Directions (2025)
Please give a thumbs up to this comment if you found it helpful!
If you want recommendations for any Paper on Hugging Face checkout this Space
You can directly ask Librarian Bot for paper recommendations by tagging it in a comment:
@librarian-bot
recommend
Models citing this paper 0
No model linking this paper
Datasets citing this paper 0
No dataset linking this paper
Spaces citing this paper 0
No Space linking this paper
Collections including this paper 0
No Collection including this paper