RoboGCG - TechBlog

Adversarial Attacks on Robotic Vision-Language-Action Models

Paper: arXiv:2506.03350
Code: GitHub

Motivation

Large language models (LLMs) are known to be vulnerable to adversarial attacks (such as jailbreaking).

Do VLAs built on the LLM architecture inherit these vulnerabilities?

Can attackers manipulate robots into performing arbitrary target actions by means of text prompts?

For VLA model, [Text Instruction] + [Current Image] → 7-DoF Action Vector

If you can edit the text prompt, you can hijack the robot's behavior.

Attacker selects a desired 7-dimensional action $a^* \in [-1,1]^7$ (e.g., [0.9, -0.3, 0, 0, 0, 0, 1.0] for "move forward + open gripper").
Discretize each dimension into 256 bins → map to 7 fixed tokens via the model's Symbol Tuning vocabulary.
These 7 tokens $T^* = [\tau_1, ..., \tau_7]$ become the frozen optimization target.

x_{1:n} = [Natural Language Task] ⊕ [Adversarial Suffix S]
          ↑_______ fixed _______↑    ↑___ optimizable ___↑

Loss function (for single-step attack):

$$ \mathcal{L} = -\log \Pr\big(T^* \mid x_{1:n}, z\big) $$

where $z$ = image embedding (frozen, read-only context).

Greedy Coordinate Gradient (GCG) loop:

Forward pass: compute $\mathcal{L}$ with current suffix $S^{(t)}$ and image $z$.
Continuous relaxation: replace suffix token IDs with their embeddings $E_S$, compute $\nabla_{E_S} \mathcal{L}$ via backprop.
Greedy token swap: for each suffix position, find the vocabulary token whose embedding most aligns with $-\nabla \mathcal{L}$; replace if it reduces loss.
Early stop: if the model autoregressively outputs exactly $T^*$, done.
Repeat (~30–110 iterations on average).

To make the suffix work across changing visual inputs:

$$ \mathcal{L}{\text{persist}} = \sum{j=1}^{r} -\log \Pr\big(T^* \mid x_{1:n}, z_j\big) $$

Sample $r$ image embeddings $\{z_1, ..., z_r\}$ from different robot states.
Jointly optimize suffix to minimize aggregate loss → suffix becomes robust to visual perturbations.

Export the final suffix $S^*$ as plain text.
Append to any instruction: "Pick up the cup " + decode(S^*).
Feed to VLA in simulation or real robot → robot executes attacker's predefined action, regardless of actual scene content.

Single-step attack success: 77–97% across 4 OpenVLA fine-tunes (LIBERO benchmark).
Persistence: Attack-induced actions last up to 28× longer with multi-image optimization.
Transferability: Suffixes optimized on OpenVLA often work on unseen architectures (TraceVLA, CogACT, OpenPi0).

Share this post