Cheap (and Good) Rewards from a VLM
Jialin Lu · February 27, 2026
Code available at luxxxlucy/vla-reward TL;DR Attempts and experiment to get cheap and good dense rewards from a VLM for robotic manipulation tasks. After some experiments, I believe that, for this task (getting reward from VLM) and maybe VLA tasks, current VLMs might over-investe in advancing the language part and less so on the vision part. And it seems it is keeping going that way, we might need VLMs that have larger vision and slightly larger language instead.
I recently read the TOPReward paper (that is out just a few days ago)TOPReward: Token Probabilities as Hidden Zero-Shot Rewards for Robotics (Chen, Harrison, Lee et al., February 2026).—it finds quite a simple way to get dense rewards for robotic manipulation: just read one token probability from a VLM, with no training needed. It is surprising how well this works with such a simple setup.
Today’s robotic manipulation systems—from neural-network-based controllers to the newer Vision-Language-Action models (VLAs)—are often adapted to new tasks using reinforcement learning (RL). But RL needs a dense, well-shaped reward signal to work well. Manual annotation, in simpler collecting ways, typically only gives sparse rewards (a single success/fail at the end, or at a few key moments), which makes learning slow and sample-inefficient.Sparse rewards leave the learner with no gradient signal for most of the trajectory. Reward shaping and engineering are the workarounds. What we need are dense rewards—a signal at every timestep saying whether the robot is getting closer or further from the goal. The alternative is to engineer reward heuristics by hand given the sparse signal (say at the end of the episode), but that is brittle and task-specific.
What if we could get dense rewards from a
pretrained model—one that some lab already spent enormous resources
building—for free?
Previous approaches use VLMs by prompting them to generate a
numeric progress score as text—for instance, GVLGVL: Vision Language Models are In-Context Value Learners (Mees et al., 2024) asks GPT-4o to rate task progress as text.
asks GPT-4o to rate progress from 1 to 10.
Note that this means to generate tokens of the numerical values.
But models are poorly suited for this task; they produce vague or inconsistent numbers.
TOPReward finds a much simpler idea: instead of asking the model to
write the reward as (possibly multiple) tokens that represents a numeric value, it just checking the VLM's output layer regarding a particular token (" true"), using
the probability of generating this token as a more calibrated and valid reward signal:
Input 1: video of the robot trajectory
Input 2: prompt
The above videos show a robot manipulation trajectory that completes the following task: {task_description}. Decide whether the above statement is True or not. The answer is:Reward: log P(
" True")
The higher this probability, the more the model considers that the video being fed is working towards the task description goal. Do this for several video prefixes sampled along the trajectory, and you get a dense reward curve that should rise as the task progresses. This seems too good to be sure, open-source model, no training, dead-simple setup. So I tried it.
How does it work
I collected several robot manipulation
videos and ran TOPReward with Qwen3-VL models.All inference via MLX. Some Implementation caveats: the token is " True" with a leading space (ID 3007 in Qwen), not "True" (ID 2514); also use_chat_template=False is critical.
Below is what TOPReward produces in practice.
For a successful task video demonstration, the ideal reward should increase monotonically as the robot makes progress. The metric that captures this is called VOC (Value-Order Correlation)—the Spearman rank correlation between reward values and their position in the trajectory.VOC = 1.0 means the reward increases at every step; 0.0 is random noise. The original paper reports VOC = 0.857 on Open X-Embodiment (780 episodes) with Qwen3-VL-8B. For comparison, GVL—which asks GPT-4o to generate progress text—gets only 0.194. I tested on 8 videos covering tasks of varying difficulty: fold towel, remove cap, put block/marker/pen into cup, stack cubes (two sources), and ride bike. The 8B model performs consistently across these tasks (mean VOC 0.916, std 0.060). The 2B model is far more variable (mean 0.766, std 0.244)—fine on easy tasks like marker→cup (0.970), but near-random on hard ones like pen→cup (0.215). The gap is driven almost entirely by the difficult cases.
Using P("True") − P("False") instead of just P("True")
When I first read the paper, a natural idea came to me: A forward pass not only gives us log P("True") but also log P("False"). Why not subtract one from the other? The intuition is that if the model is uncertain, both probabilities are low, and the difference cancels out; if the model is confident about completion, P("True") dominates and the difference is large. It costs nothing extra. Unfortunately, this idea does not work quite well. For the 7 videos we evaluated, five of them get worse:
| Video | Baseline | Contrastive | Δ |
|---|---|---|---|
| Fold towel | 0.680 | 0.394 | −0.29 |
| Remove cap | 0.771 | −0.489 | −1.26 |
| Block → cup | 0.920 | 0.604 | −0.32 |
| Marker → cup | 0.970 | 0.970 | 0.00 |
| Stack cubes (LR) | 0.884 | 0.825 | −0.06 |
| Pen → cup | 0.215 | 0.467 | +0.25 |
| Stack cubes (RS) | 0.924 | 0.867 | −0.06 |
| Mean | 0.766 | 0.520 | −0.25 |
The remove_cap reward curve is now even negative (−0.489). The hope was that subtracting P("False") would calibrate the signal, but then it does work. My guess is that in a 150k-token vocabulary, "True" and "False" are just two tokens among thousands of plausible continuations ("Yes", "No", "Correct", "It", …). P("False") is not the complement of P("True")—it is another noisy signal, and subtracting noise from noise gives more noise. This is such a natural idea that I suspect the TOPReward authors tried it too and quietly dropped it. I would not have believed it does not help, but the numbers say otherwise.
Oh ensemble, it always works (and it did)
The key design now is how we phrase the prompt. Suppose we have a large space of prompts, and there surely would be a distribution of good prompts that can help us do it. Since nobody knows which phrasing works best for a given task, this now is a human-driven manual prompt improvement process. Aside from trying to find the best prompt, a natural move here is to try several and average the results, i.e. doing an ensemble. If individual prompts are noisy but we trust that the human-sampled ones are roughly unbiased, averaging them reduces variance.The standard ensemble variance-reduction argument: if N estimators are noisy but unbiased with uncorrelated errors, averaging reduces variance by ~1/N.
Across all 7 videos, even though we do see a huge gap between a small model (Qwen-VL 2B) and a larger model (Qwen-VL 8B), a N-prompt ensemble (N = 3) easily closes 62% of the 2B→8B gap (mean VOC: 2B = 0.766, ensemble = 0.858, 8B = 0.916). Per-task: on fold towel, the ensemble (0.985) exceeds 8B (0.937). On block→cup, 2B is already good enough. On pen→cup, the 2B baseline is near-random (0.215); the ensemble helps (0.455) but does not approach 8B (0.957). Though some tasks remain hard for a small model.
The ensemble means N prompts with three forward passes with sizable improvement, but N times the cost. Now we can do a quick optimization, as we quickly note that the vision encoder only needs to run once: the cached embeddings can be reused, and only the language decoder reruns for each prompt. This collapses the 3-prompt overhead to roughly 2×. Pushing to 10 prompts is where it gets interesting: the cached N=10 ensemble costs about 11× baseline—nearly the same as a single 8B run (11.7×). You can either run one large model or ten diverse prompts on a small model for roughly the same price, and the ensemble would have a better calibration.There should be more optimization opportunities: batching prompts, quantizing the language decoder more aggressively, or distilling the ensemble into learned soft tokens. The actual numbers (fold towel, 2B, 10 prefixes) are below.
| Configuration | Vision | Language | Total | vs baseline |
|---|---|---|---|---|
| 2B single prompt (cached) | 30.5s | 27.7s | 58.2s | 1.0× |
| 2B ensemble 3 (cached) | 32.1s | 93.1s | 125.2s | 2.1× |
| 2B ensemble 10 (cached) | 65.3s | 574.7s | 640.1s | 11.0× |
| 8B single prompt (cached) | 70.2s | 613.2s | 683.4s | 11.7× |
More experiments on prompts
To explore the prompt space better, I wrote 10 variants
and tested each individually. All prompts end with
The answer is: so the next token is True/False.
{task} is replaced with the instruction (e.g., "Put the pen into
the cup.").
| # | Prompt text |
|---|---|
| 0 | The above images show a robot manipulation trajectory that completes the following task: {task}. Decide whether the above statement is True or not. The answer is: |
| 1 | The above images show a robot attempting the task: {task}. Based on the images, has the robot successfully completed this task? True or False. The answer is: |
| 2 | Looking at this sequence of robot images, the task '{task}' has been completed. Is this statement True or False? The answer is: |
| 3 | The robot in these images is performing: {task}. The task is now finished and the goal has been achieved. True or False? The answer is: |
| 4 | These images depict a robot trajectory. The intended task was: {task}. The desired outcome has been reached. True or False. The answer is: |
| 5 | Robot task: {task}. The above images show this task being completed. True or False? The answer is: |
| 6 | The above robot trajectory shows progress on: {task}. The task has been fully completed by the end of the sequence. True or False? The answer is: |
| 7 | Please confirm: the robot manipulation shown above successfully completes the task '{task}'. True or False? The answer is: |
| 8 | The above images show a robot. The goal is: {task}. The goal state has been achieved in these images. Answer True or False. The answer is: |
| 9 | Examining the robot images above, there is no failure in completing the task: {task}. The task succeeded. True or False? The answer is: |
Unsurprisingly, the 3 prompts that I came up with are not the best prompt individually for any task.How surprising is that :) The spread across prompts on a single video is wider than the entire 2B→8B gap. On pen→cup, the original prompt gives VOC = 0.215 (nearly random), while prompt #9 ("there is no failure in completing the task") gives 0.952—matching the 8B model. Prompt wording, not model size, is doing most of the work. No single prompt dominates: the best for pen→cup (#9) is mediocre on fold towel, and vice versa.
I also tried selecting the top-3 prompts by mean VOC on these 3 videos and ran them on all 7. The result: mean VOC of 0.738—worse than the default 3-prompt ensemble at 0.858. This makes sense: for an ensemble to work well, the prompts need to be diverse so their errors are uncorrelated. Picking the individually best prompts reduces that diversity, making the ensemble more correlated and thus less effective. Random diversity beats careful selection.Beyond manual prompt search, automated approaches like DSPy, MIPROv2 could search over prompt text using VOC as the objective. Prefix tuning (Li & Liang, ACL 2021), learning continuous embedding vectors on a frozen VLM—could sidestep the discrete vocabulary entirely, is also a possibility and might give us the best result, but it now has training cost. These are worth exploring though.
What I found, and where this could go
The finding that surprised me most is how little model size matters relative to phrasing of the prompt. The prompt seems to be of utmost importance. Switching prompts on the 2B model spans VOC from 0.21 to 0.95 on the hardest task; it means if we could get better prompts, the entire 2B→8B gap can be closed. Automated prompt search (e.g. DSPy MIPROv2) or prefix tuning could push this further—I have not tried these yet. Even without that, a simple 3-prompt ensemble already closes 62% of the gap.
| Method | Mean VOC | Note |
|---|---|---|
| 8B baseline | 0.916 | Consistent. ~12× cost of 2B. |
| 2B ensemble (3 prompts) | 0.858 | 62% gap closure. ~2× cost. |
| 2B baseline | 0.766 | Too variable on hard tasks. |
| Contrastive 2B | 0.520 | Subtracting P("False") adds noise. |
But the more interesting observation is why this works. Consider what the model actually does: it looks at video frames and produces a single True/False logit. Just basic visual understanding and a simple language completion. Note that in the ways how the models are trained, the frontier labs and corporations would probably try to improve the language part of the model so that high-end ability emerges: reasoning, long context, multi-round conversation.
But for the rewards, we probably do not need that. The task here we are looking at for getting the rewards, it does not need such high-end emergent capabilitis. My understanding is that for assessing the task-completeness from videos, the most hard work should be on the vision part (the vision encoder), understanding what is happening; the language decoder is barely exercised out of the basic seqeuence completion task (while though the vision-language aligning is also important).
This is perhaps not surprising if we think about it from a biological perspective: in humans, roughly 30% of the cerebral cortex is devoted to visual processing—far more than any other sensory modality. Understanding a scene from video is fundamentally harder than completing a sentence. And so naturally I think for VLMs and VLAs the vision part should be more heavy-weighted.
However, modern VLMs are not designed with this balance in mind and it seems increasingly the focus has been on the language part. If we look at the parameter breakdown:
- Qwen3-VL: the 2B and 8B models share similar vision encoders (SigLIP2, ~300M vs ~400M) while the language decoder scales from 1.5B to 7.5B. At 72B, the vision encoder is still only ~0.7B—a 1:103 ratio.
- LLaVA-1.5: ~0.3B vision (CLIP ViT-L), 7–13B language.
- Idefics2: ~0.4B vision (SigLIP), 7B language.
The pattern is clear: vision stays small while language scales 10–100×. This makes sense for general-purpose chatbots that need reasoning, long context, and multi-round conversation. But for producing a single True/False logit from video, most of that language capacity is wasted. A few models buck this trend—InternVL pairs a 6B InternViT with a 20–70B LM (1:3 to 1:12 ratio), PaLI-X pairs a 22B ViT with a 32B LM (nearly 1:1.5). I think they should be.
The following diagram shows this contrast conceptually: the model we have today, where frontier VLMs are heading, and what we might actually want for reward extraction.
For training-free VLM rewards, we do not need a powerful reasoning backbone—basic language understanding suffices. What matters is good visual scene understanding. A model that invests more compute in vision and less in language could push quality further at lower total cost. Seven videos is not a large evaluation, and I only tested one model family; but the pattern is clear enough to be worth following.
Epilogue
I plan to revisit this idea if I have time, and if I can get hands on resources. Stay tuned.
Separately, while doing this research, Running VLAs at Real-time Speed (Ma et al., 2025) caught my eye—showing that pi0-level VLAs can hit 30 Hz on a single consumer GPU. Making robotics affordable and accessible is a theme I want to keep exploring too.