Graphical User Interface (GUI) automation holds significant promise for
enhancing human productivity by assisting with computer tasks. Existing task
formulations primarily focus on simple tasks that can be specified by a single,
language-only instruction, such as "Insert a new slide." In this work, we
introduce VideoGUI, a novel multi-modal benchmark designed to evaluate GUI
assistants on visual-centric GUI tasks. Sourced from high-quality web
instructional videos, our benchmark focuses on tasks involving professional and
novel software (e.g., Adobe Photoshop or Stable Diffusion WebUI) and complex
activities (e.g., video editing). VideoGUI evaluates GUI assistants through a
hierarchical process, allowing for identification of the specific levels at
which they may fail: (i) high-level planning: reconstruct procedural subtasks
from visual conditions without language descriptions; (ii) middle-level
planning: generate sequences of precise action narrations based on visual state
(i.e., screenshot) and goals; (iii) atomic action execution: perform specific
actions such as accurately clicking designated elements. For each level, we
design evaluation metrics across individual dimensions to provide clear
signals, such as individual performance in clicking, dragging, typing, and
scrolling for atomic action execution. Our evaluation on VideoGUI reveals that
even the SoTA large multimodal model GPT4o performs poorly on visual-centric
GUI tasks, especially for high-level planning.

通过视频 GUI 评估可视化导向的图形用户界面 (GUI) 任务上 GUI 助手的表现，并发现当前最先进的大型多模态模型 GPT4o 在高级规划方面表现不佳。

VideoGUI: 从教学视频中的 GUI 自动化基准

VideoGUI: A Benchmark for GUI Automation from Instructional Videos

Large vision-language models (LVLMs) have recently achieved rapid progress,
exhibiting great perception and reasoning abilities concerning visual
information. However, when faced with prompts in different sizes of solution
spaces, LVLMs fail to always give consistent answers regarding the same
knowledge point. This inconsistency of answers between different solution
spaces is prevalent in LVLMs and erodes trust. To this end, we provide a
multi-modal benchmark ConBench, to intuitively analyze how LVLMs perform when
the solution space of a prompt revolves around a knowledge point. Based on the
ConBench tool, we are the first to reveal the tapestry and get the following
findings: (1) In the discriminate realm, the larger the solution space of the
prompt, the lower the accuracy of the answers. (2) Establish the relationship
between the discriminative and generative realms: the accuracy of the
discriminative question type exhibits a strong positive correlation with its
Consistency with the caption. (3) Compared to open-source models, closed-source
models exhibit a pronounced bias advantage in terms of Consistency. Eventually,
we ameliorate the consistency of LVLMs by trigger-based diagnostic refinement,
indirectly improving the performance of their caption. We hope this paper will
accelerate the research community in better evaluating their models and
encourage future advancements in the consistency domain.

通过多模态基准测试工具 ConBench，本研究首次揭示了大型视觉和语言模型在解决方案空间不同的提示下的答案一致性问题，并通过基于触发器的诊断优化方法，间接提高了模型的性能，以增强其描述能力。

揭开大型视觉语言模型的一致性之纱

Unveiling the Tapestry of Consistency in Large Vision-Language Models

Efficient molecular modeling and design are crucial for the discovery and
exploration of novel molecules, and the incorporation of deep learning methods
has revolutionized this field. In particular, large language models (LLMs)
offer a fresh approach to tackle scientific problems from a natural language
processing (NLP) perspective, introducing a research paradigm called scientific
language modeling (SLM). However, two key issues remain: how to quantify the
match between model and data modalities and how to identify the
knowledge-learning preferences of models. To address these challenges, we
propose a multi-modal benchmark, named ChEBI-20-MM, and perform 1263
experiments to assess the model's compatibility with data modalities and
knowledge acquisition. Through the modal transition probability matrix, we
provide insights into the most suitable modalities for tasks. Furthermore, we
introduce a statistically interpretable approach to discover context-specific
knowledge mapping by localized feature filtering. Our pioneering analysis
offers an exploration of the learning mechanism and paves the way for advancing
SLM in molecular science.

通过使用多模态基准 ChEBI-20-MM，我们评估了模型与数据模态的兼容性和知识获取，并通过模态转移概率矩阵提供了适用于任务的最合适的模态，同时引入了一种统计可解释的方法，通过局部特征过滤来发现具有上下文特定的知识映射，从而揭示了科学语言建模在分子科学中的学习机制及其推进方法的可能性。