Poster 197

Visual Prompting Reimagined: The Power of Activation Prompts

Yihua Zhang · Hongkang Li · Yuguang Yao · Aochuan Chen · Shuai Zhang · Pin-Yu Chen · Meng Wang · Sijia Liu

Abstract

Visual prompting (VP) has emerged as a popular method to repurpose large pretrained models for downstream vision tasks. Unlike many parameter-efficient finetuning (PEFT) techniques that modify model parameters, VP introduces a universal perturbation directly into the input data to facilitate task-specific finetuning while keeping the pretrained model intact. However, there exists a noticeable performance gap between VP and conventional finetuning methods, highlighting an unexplored realm in theory and practice to understand and advance VP to close its performance gap. Towards this end, we introduce a novel concept, termed activation prompt (AP), which extends the scope of input-level VP by enabling universal perturbations to be applied to activation maps within the intermediate layers of the model. With the aid of AP, we show that VP, by its input perturbation design, has intrinsic limitations in both performance and efficiency. By contrast, AP shares a natural connection to normalization tuning, e.g., batch normalization for convolutional neural networks (CNNs) and layer normalization for vision transformers (ViTs). This illuminates the reason behind the observed better accuracy of normalization tuning than VP in the literature. Furthermore, we show that the choice of prompting exhibits a distinct preference for layer depth, with conclusions varying significantly between CNNs and ViTs. We theoretically elucidate the rationale behind such preference by analyzing global features across layers. By conducting extensive experiments across 29 datasets and various model architectures, we provide a thorough performance analysis of AP, comparing it with VP and PEFT baselines. Our experimental results demonstrate that AP significantly surpasses the input-level VP in terms of both accuracy and efficiency, considering factors like time, parameters, memory usage, and throughout. These results further support our new insights into the incapabilities of VP and the capabilities of AP.