Skip to content

Latest commit

 

History

History
66 lines (51 loc) · 9.63 KB

Leaderboard.md

File metadata and controls

66 lines (51 loc) · 9.63 KB

Leaderboard

We present the evaluation results on our own devices for reference. All models were evaluated uniformly on Spec-Bench using the same device and the testing environment. We report the mean speedup over 3 different runs and #mean accepted tokens per decoding step (which is 1.00 for vanilla autoregressive decoding).

❗️It is important to note that model speedup rates may differ across various devices. For more precise speedup metrics, we recommend conducting evaluations of specific models on your intended devices.

🤔 This is a gentle reminder that while speedup is the primary metric for assessing Speculative Decoding methods, other benefits are worth considering. For example, PLD and Lookahead require no extra parameters, making it simpler to integrate a wider range of models.

Leaderboard on 3090

  • Device: a single NVIDIA GeForce RTX 3090 GPU (24GB) with 12 CPU cores
  • Testing environment: Pytorch 2.0.1, under CUDA 11.8
  • Experimental Settings: Vicuna-7B-v1.3, greedy decoding, FP16 precision, batch size = 1
Models Multi-turn Conversation Translation Summa-rization Question Answering Mathematical Reasoning Retrieval-aug. Generation #Mean Accepted Tokens Overall
EAGLE🏅 2.44x 1.81x 2.13x 2.11x 2.54x 1.82x 3.57 2.16x
SpS🥈 1.98x 1.37x 2.00x 1.95x 1.89x 1.76x 2.29 1.83x
Hydra🥉 2.04x 1.67x 1.56x 1.81x 2.16x 1.48x 3.26 1.80x
PLD 1.57x 1.07x 2.31x 1.25x 1.62x 1.56x 1.74 1.55x
Medusa 1.60x 1.38x 1.28x 1.46x 1.64x 1.22x 2.32 1.44x
REST 1.49x 1.18x 1.21x 1.46x 1.35x 1.27x 1.63 1.32x
Lookahead 1.13x 0.97x 1.05x 1.07x 1.29x 0.98x 1.65 1.08x

Leaderboard on A100

  • Device: a single NVIDIA A100 GPU (80GB) with 64 CPU cores
  • Testing environment: Pytorch 2.0.1, under CUDA 11.4
  • Experimental Settings: greedy decoding, FP16 precision, batch size = 1

Vicuna-7B-v1.3

Models Multi-turn Conversation Translation Summa-rization Question Answering Mathematical Reasoning Retrieval-aug. Generation #Mean Accepted Tokens Overall
EAGLE🏅 2.67x 1.99x 2.23x 2.12x 2.67x 2.04x 3.61 2.29x
Hydra🥈 2.45x 1.94x 1.79x 2.03x 2.49x 1.77x 3.24 2.09x
Medusa🥉 2.05x 1.73x 1.57x 1.75x 2.05x 1.51x 2.32 1.78x
PLD 1.64x 1.04x 2.43x 1.14x 1.61x 1.71x 1.73 1.59x
SpS 1.66x 1.13x 1.62x 1.49x 1.47x 1.55x 2.28 1.49x
REST 1.63x 1.31x 1.36x 1.66x 1.21x 1.73x 1.82 1.48x
Lookahead 1.40x 1.14x 1.19x 1.24x 1.55x 1.09x 1.66 1.27x

Vicuna-13B-v1.3

Models Multi-turn Conversation Translation Summa-rization Question Answering Mathematical Reasoning Retrieval-aug. Generation #Mean Accepted Tokens Overall
EAGLE🏅 2.68x 1.96x 2.44x 2.04x 2.70x 2.23x 3.64 2.34x
Hydra🥈 2.46x 1.90x 1.93x 1.96x 2.48x 1.92x 3.35 2.12x
Medusa🥉 1.96x 1.66x 1.63x 1.63x 2.00x 1.58x 2.39 1.75x
SpS 1.60x 1.13x 1.68x 1.39x 1.53x 1.67x 2.18 1.49x
PLD 1.47x 1.02x 2.19x 1.03x 1.57x 1.71x 1.68 1.48x
REST 1.52x 1.17x 1.37x 1.53x 1.19x 1.55x 1.82 1.38x
Lookahead 1.30x 1.06x 1.20x 1.12x 1.48x 1.12x 1.63 1.22x

Vicuna-33B-v1.3

Models Multi-turn Conversation Translation Summa-rization Question Answering Mathematical Reasoning Retrieval-aug. Generation #Mean Accepted Tokens Overall
EAGLE🏅 2.79x 2.05x 2.51x 2.17x 2.99x 2.27x 3.39 2.47x
Hydra🥈 2.59x 2.01x 2.04x 2.11x 2.71x 2.06x 3.24 2.26x
Medusa🥉 1.98x 1.73x 1.64x 1.66x 2.07x 1.62x 2.33 1.79x
SpS 1.75x 1.28x 1.76x 1.53x 1.69x 1.68x 2.01 1.61x
REST 1.63x 1.27x 1.45x 1.61x 1.30x 1.61x 1.80 1.48x
PLD 1.44x 1.06x 2.00x 1.07x 1.55x 1.45x 1.55 1.42x
Lookahead 1.32x 1.08x 1.20x 1.16x 1.54x 1.15x 1.61 1.24x