A VLA on (MacBook) Air
Code:
phase3
phase3 experiment
phase2
phase2 experiments
phase1
phase0 experiments
phase0 (pytorch-baseline)
April 24 2026
Here we want to run a Vision-Language-Action (VLA) model on my MacBook Air. Initially this is because I read a paper Ma et al., Running VLAs at Real-time Speed, arXiv 2510.26742. from Dexmal that manages to run π0, Black et al., π0: A Vision-Language-Action Flow Model for General Robot Control, arXiv 2410.24164. a VLA from Physical Intelligence, at realtime rates (30+ Hz) on an RTX 4090. A fanless laptop has far less compute, but we want to test its limit. Here we record our progress moving from 0.81 Hz to 3.57 Hz, a 4.42× boost.
| what happened | rate | |
|---|---|---|
| Baseline | 0.81 Hz | |
| Phase 1: CoreML with ANE | 1.83 Hz | |
| Phase 2: token parallelization and async scheduling | 3.08 Hz | |
| Phase 3: quantization | 3.57 Hz | |
| Theoretical limit | 6.6 Hz | |
| RTX 4090 (Dexmal realtime VLA) | 36.6 Hz | ← 10.3× our rate with 16× compute |
π0 is a 3.3-billion-parameter flow-matching VLA. It takes one (or more) RGB camera image and a short text instruction, and generates a chunk of future actions. It has three stages:
- Vision (V): SigLIP-So400m turns one (or 2-view / 3-view) image into vision tokens.
- Vision + Language prefix (L): Gemma-1-2B runs bidirectional attention over vision + language tokens to build the joint prefix representation that will be used for generating action.
- Action expert (AE): a Gemma-300M runs 10 Euler steps of flow matching, attending back to the prefix KVs.
Now that we want to do optimization, we shall start from first principles along with some simple math to ground us at what is possible and what is not. How much we can achieve is ultimately determined by our hardware. Compared to an RTX 4090 An RTX 4090 setup has roughly 16× our compute and 8× our memory bandwidth. on a Linux box, the main difference of having a MacBook is that:
- It has mainly three computing devices: CPU, GPU, and the Apple Neural Engine (ANE).
- It has a shared unified memory and its memory bandwidth constraints. Later we will find that even though the ANE is quite efficient (quicker than the GPU via MLX for inference tasks), ANE is a memory monster: when the ANE is fully active it consumes most of the available memory bandwidth.
What we can do from here, is isomorphic with respect to the above two points:
- To utilize all the hardware, we need to dispatch the computation to all three devices so that we have maximal combined usage.
- We need to do this under the constraint of the memory bandwidth.
Unlike some other places where I am more familiar with, I found it hard to estimate bandwidth utilization for ML inference tasks, so I would like to calculate from the compute side instead. An oversimplified estimate The TFLOPS numbers below (3.23 GPU, 5.7 ANE, 1.8 CPU) are estimated based on dense-matmul operations, not architectural maxima nor advertised numbers. Apple advertises 38 TOPS for the ANE (which is INT8 ops); halve once for fp16 we get ~19 TFLOPS, but we measured ~5.7 for dense matmul. can give us a theoretical limit:
$$ \begin{aligned} t_{\min} \;&=\; \frac{F_V + F_L + F_{AE}}{P_\text{GPU} + P_\text{ANE} + P_\text{CPU}} \\[4pt] &=\; \frac{60 + 1224 + 338 \text{ GFLOP}}{3.23 + 5.7 + 1.8 \text{ TFLOPS}} \;=\; \frac{1622}{10.73} \;\approx\; 151 \text{ ms} \;\;\Longrightarrow\;\; 6.6 \text{ Hz} \end{aligned} $$
This is of course naive. But we can start from here taking it as a reminder.
Progress
Baseline: 0.81 Hz
We start with a simple PyTorch port of the reference implementation. The major difference is swapping the CUDA backend for Metal Performance Shaders (MPS). The result is about 1.24 seconds per cycle. At this rate the robot replans about once per second, and most of the time is spent in the vision and language prefix build up (V+L prefix, we denote as L from now on). And in particular, if we look inside L, 92 % is consumed by the MLP matmul.
The issue is obvious:
- Whether the GPU computation is efficient or not.
- MPS dispatches everything to the GPU, so the ANE and CPU sit idle the entire cycle.
We will answer these two issues in phase 1 and phase 2.
Phase 1: CoreML with ANE, 1.83 Hz
Starting from the baseline observation, we want to actually use the ANE instead of the GPU. We drop PyTorch and rewrite the inference in Objective-C, which gives direct access to CoreML and lets us dispatch each stage to the CPU, ANE, or GPU. We measure every stage on every engine:
| stage | ANE | GPU | CPU |
|---|---|---|---|
| V (vision) | 36 ms | 83 ms | 81 ms |
| L (V+L prefix) | 305 ms | 434 ms | 442 ms |
| AE (action expert) | 206 ms | 887 ms | 255 ms |
In isolation, every stage is fastest on the ANE. Understandable, since the ANE is purpose-built for inference while the GPU is more general. We can run all three stages serially on the ANE. This gives V (36 ms) + L (305 ms) + AE (206 ms) = 547 ms per cycle, 1.83 Hz, 2.26× faster.
One tiny trick along the way: on the ANE we work in fp16, but RMSNorm's running sum Σx² overflows for real activations. We do a rewrite that scales x by 1/rmax before squaring.
Phase 2: token parallelization and async scheduling, 3.08 Hz
Phase 1 runs all three stages on the ANE while the GPU and CPU are idle. For this we provide two ideas: token parallelization and async scheduling.
Token parallelization: split L across ANE and GPU
The bidirectional attention forces a synchronization point (every attention layer needs all token embeddings to be present), but the MLP before the attention is pointwise across tokens. So the MLP rows for different tokens can run on different engines in parallel, as long as they rejoin before the next attention.
We split L's MLP at the token dimension so both engines work in parallel, with a semaphore syncing the two halves at each of L's 18 layers. This drops L standalone from 305 to 245 ms.
Async scheduling: overlap stages across frames
After consideration we decide to move AE off the ANE onto the CPU AE on CPU is 24 % slower than AE on ANE standalone, but it frees the ANE for V and L. so it can run concurrently with L. We then overlap across frames: AE of frame n−1 runs concurrently with V and L of frame n, so the cycle becomes the slowest single engine rather than the sum. Concurrency jams the memory bandwidth; AE on CPU inflates from 245 to 330 ms.
Phase 3: quantization, 3.57 Hz
Phase 2 made the memory-bandwidth ceiling visible: each stage takes longer under concurrency than standalone, which is the signature of a bus-bound workload. From here we have two options:
- Move to a higher-tier chip with more bandwidth.
- Reduce the weight-byte footprint through pruning, distillation, or quantization.
We pick quantization. We quantize each stage's weights in the format that suits its engine:
| stage | engine | quantization |
|---|---|---|
| V | ANE | 4-bit k-means palettization, group = 16 |
| L (ANE half) | ANE | 4-bit k-means palettization, group = 16 |
| L (GPU half) | GPU | linear per-block 4-bit, block = 32 |
| AE | CPU | linear per-channel 8-bit |
L and AE are both shorter, so the Phase 2 async slack tightens. Finally we arrive at 280 ms, 3.57 Hz. That is 4.42× the baseline, and only 10.3× behind Dexmal on a 4090 with 16× the compute and 8× the memory bandwidth.
Discussion: the cost of bidirectional attention in the V+L prefix
Aside from the memory bandwidth issue, which is too boring, we do get some lessons on the effect of architecture design on inference performance. The most important finding is that the V+L prefix (stage L) dominates the compute. Every phase in this post is in some sense a workaround for this.
It runs from scratch because π0 inherits PaliGemma's prefix-LM design: vision tokens and language tokens sit in a single bidirectional block inside the Gemma-2B backbone. Every language token attends to every vision token and the other way around. So the language tokens' KVs are not a function of the language tokens alone; they are entangled with the vision tokens. When the camera image changes on the next frame, the entire V+L prefix must be recomputed.
A causal prefix gives a totally different case. If the text tokens sit in their own causal block (text first, then image tokens attending back to text), the language KVs are a pure function of the (stable) instruction and can be cached across frames indefinitely. Only the vision tokens need recomputation each frame, and even those become incrementally cacheable if the image encoder can be made to ignore identical or stale frames. In our numbers L is 50–60 % of the serial cycle. Amortising the language half across frames, paying only the vision half each frame, would roughly halve per-frame latency, maybe a 2–3× speedup depending on how much of V is also reusable.
The idea is not new for causal vision models. Eventful Transformers Dutson, Li, Gupta, Eventful Transformers: Leveraging Temporal Redundancy in Vision Transformers, ICCV 2023. arXiv:2308.13494. identifies tokens whose pixels changed significantly between frames and re-processes only those, reusing cached activations for the rest.
We can broadly categorise published VLAs into two families:
- PaliGemma family (bidirectional V+L prefix): π0, π0.5, π0-FAST, SpatialVLA, and Dexmal's realtime-VLA. All inherit PaliGemma's bidirectional prefix-LM mask.
- Qwen / Llama family (causal prefix): ChatVLA, Dexbotic, Vlaser (Qwen2/2.5/3-VL backbones), GR00T N1.7 (Qwen3-VL VLM + DiT action head), CogACT (Llama-2 + DiT). All use a causal, autoregressive decoder inherited from a more standard autoregressive VLM backbone.
Now it is arguable that the bidirectional attention can indeed improve performance, but I doubt that this is not amendable with a more powerful backbone and more parameters. Even with much more parameter, an autoregressive backbone will make it much more efficient at inference time. The MacBook Air is not a powerful machine, and that is fair. But a large fraction of the remaining gap is an architecture-level choice, not a silicon-level one. So for performance-wise, worth keeping this in mind when designing VLAs.