Adversarial Attacks on Robotic Vision-Language-Action Models
Paper: arXiv:2506.03350
Code: GitHub
Motivation
Large language models (LLMs) are known to be vulnerable to adversarial attacks (such as jailbreaking).
Do VLAs built on the LLM architecture inherit these vulnerabilities?
Can attackers manipulate robots into performing arbitrary target actions by means of text prompts?
For VLA model, [Text Instruction] + [Current Image] → 7-DoF Action Vector
If you can edit the text prompt, you can hijack the robot's behavior.
Method
Predefine a Target Action
- Attacker selects a desired 7-dimensional action $a^* \in [-1,1]^7$ (e.g.,
[0.9, -0.3, 0, 0, 0, 0, 1.0]for "move forward + open gripper"). - Discretize each dimension into 256 bins → map to 7 fixed tokens via the model's Symbol Tuning vocabulary.
- These 7 tokens $T^* = [\tau_1, ..., \tau_7]$ become the frozen optimization target.
Construct the Input Prompt
x_{1:n} = [Natural Language Task] ⊕ [Adversarial Suffix S]
↑_______ fixed _______↑ ↑___ optimizable ___↑
- Base instruction: e.g.,
"Pick up the red cup." - Suffix $S$: 20 randomly initialized tokens (the only attack variables).
Optimize the Suffix via GCG
Loss function (for single-step attack):
$$ \mathcal{L} = -\log \Pr\big(T^* \mid x_{1:n}, z\big) $$
where $z$ = image embedding (frozen, read-only context).
Greedy Coordinate Gradient (GCG) loop:
- Forward pass: compute $\mathcal{L}$ with current suffix $S^{(t)}$ and image $z$.
- Continuous relaxation: replace suffix token IDs with their embeddings $E_S$, compute $\nabla_{E_S} \mathcal{L}$ via backprop.
- Greedy token swap: for each suffix position, find the vocabulary token whose embedding most aligns with $-\nabla \mathcal{L}$; replace if it reduces loss.
- Early stop: if the model autoregressively outputs exactly $T^*$, done.
- Repeat (~30–110 iterations on average).
Persistence Attack (Multi-Image Robustness)
To make the suffix work across changing visual inputs:
$$ \mathcal{L}{\text{persist}} = \sum{j=1}^{r} -\log \Pr\big(T^* \mid x_{1:n}, z_j\big) $$
- Sample $r$ image embeddings $\{z_1, ..., z_r\}$ from different robot states.
- Jointly optimize suffix to minimize aggregate loss → suffix becomes robust to visual perturbations.
Deploy
- Export the final suffix $S^*$ as plain text.
- Append to any instruction:
"Pick up the cup " + decode(S^*). - Feed to VLA in simulation or real robot → robot executes attacker's predefined action, regardless of actual scene content.
Results Snapshot
- Single-step attack success: 77–97% across 4 OpenVLA fine-tunes (LIBERO benchmark).
- Persistence: Attack-induced actions last up to 28× longer with multi-image optimization.
- Transferability: Suffixes optimized on OpenVLA often work on unseen architectures (TraceVLA, CogACT, OpenPi0).