Back to all posts
Research

RoboGCG

Attackers can fully control a VLA-driven robot by appending just ~20 optimized text tokens to a normal instruction—no image manipulation, no model access at deployment.

Adversarial Attacks on Robotic Vision-Language-Action Models

Paper: arXiv:2506.03350
Code: GitHub

Motivation

Large language models (LLMs) are known to be vulnerable to adversarial attacks (such as jailbreaking).

Do VLAs built on the LLM architecture inherit these vulnerabilities?

Can attackers manipulate robots into performing arbitrary target actions by means of text prompts?

For VLA model, [Text Instruction] + [Current Image] → 7-DoF Action Vector

If you can edit the text prompt, you can hijack the robot's behavior.

Method

Predefine a Target Action

  • Attacker selects a desired 7-dimensional action $a^* \in [-1,1]^7$ (e.g., [0.9, -0.3, 0, 0, 0, 0, 1.0] for "move forward + open gripper").
  • Discretize each dimension into 256 bins → map to 7 fixed tokens via the model's Symbol Tuning vocabulary.
  • These 7 tokens $T^* = [\tau_1, ..., \tau_7]$ become the frozen optimization target.

Construct the Input Prompt

x_{1:n} = [Natural Language Task] ⊕ [Adversarial Suffix S]
          ↑_______ fixed _______↑    ↑___ optimizable ___↑
  • Base instruction: e.g., "Pick up the red cup."
  • Suffix $S$: 20 randomly initialized tokens (the only attack variables).

Optimize the Suffix via GCG

Loss function (for single-step attack):

$$ \mathcal{L} = -\log \Pr\big(T^* \mid x_{1:n}, z\big) $$

where $z$ = image embedding (frozen, read-only context).

Greedy Coordinate Gradient (GCG) loop:

  1. Forward pass: compute $\mathcal{L}$ with current suffix $S^{(t)}$ and image $z$.
  2. Continuous relaxation: replace suffix token IDs with their embeddings $E_S$, compute $\nabla_{E_S} \mathcal{L}$ via backprop.
  3. Greedy token swap: for each suffix position, find the vocabulary token whose embedding most aligns with $-\nabla \mathcal{L}$; replace if it reduces loss.
  4. Early stop: if the model autoregressively outputs exactly $T^*$, done.
  5. Repeat (~30–110 iterations on average).

Persistence Attack (Multi-Image Robustness)

To make the suffix work across changing visual inputs:

$$ \mathcal{L}{\text{persist}} = \sum{j=1}^{r} -\log \Pr\big(T^* \mid x_{1:n}, z_j\big) $$

  • Sample $r$ image embeddings $\{z_1, ..., z_r\}$ from different robot states.
  • Jointly optimize suffix to minimize aggregate loss → suffix becomes robust to visual perturbations.

Deploy

  • Export the final suffix $S^*$ as plain text.
  • Append to any instruction: "Pick up the cup " + decode(S^*).
  • Feed to VLA in simulation or real robot → robot executes attacker's predefined action, regardless of actual scene content.

Results Snapshot

  • Single-step attack success: 77–97% across 4 OpenVLA fine-tunes (LIBERO benchmark).
  • Persistence: Attack-induced actions last up to 28× longer with multi-image optimization.
  • Transferability: Suffixes optimized on OpenVLA often work on unseen architectures (TraceVLA, CogACT, OpenPi0).

Share this post

Back to home

Comments