A fundamental challenge in autonomous driving is the integration of high-level, semantic reasoning for long-tail events with low-level, reactive control for robust driving. While large vision-language models (VLMs) trained on web-scale data offer powerful common-sense reasoning, they lack the grounded experience necessary for safe vehicle control in driving scenarios. We posit that an effective autonomous agent should leverage the world knowledge and common-sense reasoning of VLMs to guide a grounded, steerable driving policy toward safe and robust control. To this end, we propose SteerVLA, which uses reasoning from VLMs to produce fine-grained language instructions that steer a vision–language–action (VLA) driving policy. Key to our method is the language interface between the high-level VLM and low-level VLA, which allows the high-level policy to better ground the nuances of its reasoning in the control outputs of the low-level policy. To provide fine-grained language supervision aligned with vehicle control, we leverage a VLM to augment existing driving data with dense language annotations in hindsight, which we find to be essential for effective reasoning and steerability. We evaluate SteerVLA on a challenging closed-loop benchmark, where it outperforms state-of-the-art methods by 4.77 driving score overall and by 8.04 points on a long-tail subset.
We focus on the challenge of long-tail driving scenarios, where rare and unanticipated events require strong generalization and common-sense reasoning from the policy. VLAs are a strong backbone for driving because they combine semantic grounding from vision–language pretraining with domain-specific adaptation obtained via imitation learning on driving data. Building on this capability, we leverage the reasoning and semantic inference abilities of VLMs and ground these inferences in driving control through fine-grained meta-actions that steer a VLA policy. Concretely, a high-level policy first reasons about the driving scene, historical vehicle states, and routing command to produce a meta-action, accompanied by a short reasoning trace that reasons over driving scenes and helps the policy generate more appropriate meta-actions. A steerable low-level VLA policy then executes this meta-action by predicting a set of waypoints determining the vehicle's target speed and position.
On Bench2Drive, we report overall performance and per-ability scores for SteerVLA across five advanced urban driving skills. SteerVLA significantly outperforms prior approaches, benefiting from improved reasoning and instruction-following capabilities.
We compare SteerVLA with the state-of-the-art method SimLingo on Bench2Drive-LongTail. SteerVLA exhibits larger performance gains in long-tail scenarios, likely because these cases require more complex reasoning and more precise control.
@article{gao2026steervlasteeringvisionlanguageactionmodels,
title={SteerVLA: Steering Vision-Language-Action Models in Long-Tail Driving Scenarios},
author={Tian Gao and Celine Tan and Catherine Glossop and Timothy Gao and Jiankai Sun and Kyle Stachowicz and Shirley Wu and Oier Mees and Dorsa Sadigh and Sergey Levine and Chelsea Finn},
year={2026},
eprint={2602.08440},
archivePrefix={arXiv},
primaryClass={cs.RO},
url={https://arxiv.org/abs/2602.08440},
}