Preprint Article Version 1 This version is not peer-reviewed

UI-Hawk: Unleashing the Screen Stream Understanding for GUI Agents

Version 1 : Received: 29 August 2024 / Approved: 29 August 2024 / Online: 30 August 2024 (03:42:19 CEST)

How to cite: Zhang, J.; Yu, Y.; Liao, M.; Li, W.; Wu, J.; Wei, Z. UI-Hawk: Unleashing the Screen Stream Understanding for GUI Agents. Preprints 2024, 2024082137. https://doi.org/10.20944/preprints202408.2137.v1 Zhang, J.; Yu, Y.; Liao, M.; Li, W.; Wu, J.; Wei, Z. UI-Hawk: Unleashing the Screen Stream Understanding for GUI Agents. Preprints 2024, 2024082137. https://doi.org/10.20944/preprints202408.2137.v1

Abstract

Graphical User Interface (GUI) agents are expected to precisely operate on the screens of digital devices. Existing GUI agents merely rely on current visual observations and plain-text action history, ignoring the significance of history screens. To mitigate this issue, we propose UI-Hawk, a visual GUI agent specially designed to processing screen streams encountered during GUI navigation. UI-Hawk incorporates a history-aware visual encoder and an efficient resampler to handle the screen sequences. To acquire a better understanding of screen streams, we define four fundamental tasks—UI grounding, UI referring, screen question answering, and screen summarization. We develop an automated data curation method to generate the corresponding training data for UI-Hawk. Along with the efforts above, we have also created a benchmark FunUI to quantitatively evaluate the fundamental screen understanding ability of MLLMs. Extensive experiments on FunUI and GUI navigation benchmarks consistently validate that screen stream understanding is not only beneficial but also essential for GUI navigation.

Keywords

GUI Agents; Screen Stream Understanding

Subject

Computer Science and Mathematics, Artificial Intelligence and Machine Learning

Comments (0)

We encourage comments and feedback from a broad range of readers. See criteria for comments and our Diversity statement.

Leave a public comment
Send a private comment to the author(s)
* All users must log in before leaving a comment
Views 0
Downloads 0
Comments 0


×
Alerts
Notify me about updates to this article or when a peer-reviewed version is published.
We use cookies on our website to ensure you get the best experience.
Read more about our cookies here.