Motivation

A foundational mechanism in many Vision-Language-Action (VLA) models is the alignment of visual and textual representations within a shared latent space $\mathbb{R}^D$ during pre-training. Architectures such as CLIP, OpenCLIP, and SigLIP employ a visual encoder $f_\theta$ and a text encoder $g_\phi$ that project their respective inputs into this common embedding space:

$$ \begin{align*} \text{Visual Encoder} &= f_{\theta}: \mathbb{R}^{H \times W \times C} \rightarrow \mathbb{R}^{N \times D} \\ \text{Text Encoder} &= g_{\phi}: \mathcal{T}^{M} \rightarrow \mathbb{R}^{M \times D} \end{align*} $$

This shared latent space enables effective cross-modal alignment but also introduces a critical vulnerability: adversarial perturbations targeting the visual encoder can disrupt the semantic correspondence with textual instructions, thereby degrading downstream action generation.

Method

Embedding Disruption Patch Attack (EDPA)

EDPA formulates the attack as a joint optimization problem that simultaneously maximizes visual-text misalignment and embedding displacement. The adversarial perturbation $\delta$ is optimized by minimizing the following composite loss:

$$ \min_{\delta} \mathcal{L}_{\text{EDPA}} = \alpha \cdot \mathcal{L}_{\text{align}} + (1-\alpha) \cdot \mathcal{L}_{\text{patch}} + \lambda_{\text{TV}} \cdot \text{TV}(\delta) $$

where $\alpha \in [0,1]$ balances the two objectives, and $\lambda_{\text{TV}}$ controls the total variation regularization to ensure the perturbation remains physically realizable as a smooth, localized patch.

The alignment loss $\mathcal{L}_{\text{align}}$ measures the cosine similarity between the pooled visual and textual embeddings:

$$ \mathcal{L}_{\text{align}} = \text{CosSim}\big( \text{Pool}(f_\theta(v_{\text{adv}})), \, \text{Pool}(g_\phi(t)) \big) $$

By minimizing $\mathcal{L}_{\text{align}}$, the attack actively pushes the adversarial visual representation away from the target text embedding in the shared latent space.

The patch displacement loss $\mathcal{L}_{\text{patch}}$ enforces geometric separation from the original embedding:

$$ \mathcal{L}_{\text{patch}} = \left\| \text{Pool}(f_\theta(v_{\text{clean}})) - \text{Pool}(f_\theta(v_{\text{adv}})) \right\|_2^2 $$

This term ensures that the perturbed visual features deviate significantly from the clean baseline, further destabilizing the model's cross-modal reasoning pipeline.

Adversarial Fine-Tuning Defense

Conversely, the adversarial fine-tuning defense enhances robustness by encouraging representation invariance under perturbation. It fine-tunes the visual encoder parameters $\theta$ to minimize the discrepancy between clean and adversarial embeddings:

$$ \min_{\theta} \mathcal{L}_{\text{def}} = \mathbb{E}_{v, \delta} \left[ \left\| \text{Pool}(f_\theta(v_{\text{clean}})) - \text{Pool}(f_\theta(v_{\text{adv}})) \right\|_2^2 \right] $$

By minimizing this distance during training, the encoder learns to produce stable, perturbation-invariant features. This directly counteracts the embedding disruption objective while preserving the model's ability to maintain cross-modal alignment under adversarial conditions.

Share this post

Back to home

Motivation

Method

Embedding Disruption Patch Attack (EDPA)

Adversarial Fine-Tuning Defense

Comments

Scan to share on WeChat