DIVERSED: Relaxed Speculative Decoding via Dynamic Ensemble Verification
Abstract
Speculative decoding is an effective technique for accelerating large language model (LLM) inference by drafting multiple tokens in parallel. However, its practical speedup is often limited by a rigid verification step, which strictly enforces that the accepted token distribution exactly matches that of the target model. This constraint leads to the rejection of many plausible tokens, reducing the acceptance rate and hindering overall efficiency. To overcome this limitation, we propose DIVERSED (DynamIc VErification RElaxed SpEculative Decoding), a relaxed verification framework that improves efficiency while preserving generation quality. DIVERSED introduces a learned ensemble-based verifier that blends the draft and target model distributions using dynamic mixing weights. This mixture distribution serves as a more flexible verification target, increasing token acceptance without compromising on overall correctness. We provide theoretical justification for our approach and demonstrate empirically that DIVERSED achieves significantly higher inference efficiency compared to traditional speculative decoding methods.