Preprint Article Version 1 This version is not peer-reviewed

Benchmarking Vision, Language, & Action Models on Robotic Learning Tasks

Version 1 : Received: 6 November 2024 / Approved: 6 November 2024 / Online: 7 November 2024 (11:14:41 CET)

How to cite: Guruprasad, P.; Sikka, H.; Song, J.; Wang, Y.; Liang, P. Benchmarking Vision, Language, & Action Models on Robotic Learning Tasks. Preprints 2024, 2024110494. https://doi.org/10.20944/preprints202411.0494.v1 Guruprasad, P.; Sikka, H.; Song, J.; Wang, Y.; Liang, P. Benchmarking Vision, Language, & Action Models on Robotic Learning Tasks. Preprints 2024, 2024110494. https://doi.org/10.20944/preprints202411.0494.v1

Abstract

Vision-language-action (VLA) models represent a promising direction for developing general-purpose robotic systems, demonstrating the ability to combine visual understanding, language comprehension, and action generation. However, systematic evaluation of these models across diverse robotic tasks remains limited. In this work, we present a comprehensive evaluation framework and benchmark suite for assessing VLA models. We profile three state-of-the-art VLM and VLAs —GPT-4o, OpenVLA, and JAT—across 20 diverse datasets from the Open-X-Embodiment collection, evaluating their performance on various manipulation tasks. Our analysis reveals several key insights: (1) current VLA models show significant variation in performance across different tasks and robot platforms, with GPT-4o demonstrating the most consistent performance through sophisticated prompt engineering, (2) all models struggle with complex manipulation tasks requiring multi-step planning, and (3) model performance is notably sensitive to action space characteristics and environmental factors. We release our evaluation framework and findings to facilitate systematic assessment of future VLA models and identify critical areas for improvement in the development of general-purpose robotic systems.

Keywords

benchmark; machine learning; vision language model; large language model; vision language action; vla; robotic learning; offline RL; robotics; control

Subject

Computer Science and Mathematics, Artificial Intelligence and Machine Learning

Comments (0)

We encourage comments and feedback from a broad range of readers. See criteria for comments and our Diversity statement.

Leave a public comment
Send a private comment to the author(s)
* All users must log in before leaving a comment
Views 0
Downloads 0
Comments 0


×
Alerts
Notify me about updates to this article or when a peer-reviewed version is published.
We use cookies on our website to ensure you get the best experience.
Read more about our cookies here.