For decades, human-computer interaction has fundamentally been manual. Even
today, almost all productive work done on the computer necessitates human input
at every step. Autonomous virtual agents represent an exciting step in
automating many of these menial tasks. Virtual agents would empower users with
limited technical proficiency to harness the full possibilities of computer
systems. They could also enable the efficient streamlining of numerous computer
tasks, ranging from calendar management to complex travel bookings, with
minimal human intervention. In this paper, we introduce OmniACT, the
first-of-a-kind dataset and benchmark for assessing an agent's capability to
generate executable programs to accomplish computer tasks. Our scope extends
beyond traditional web automation, covering a diverse range of desktop
applications. The dataset consists of fundamental tasks such as "Play the next
song", as well as longer horizon tasks such as "Send an email to John Doe
mentioning the time and place to meet". Specifically, given a pair of screen
image and a visually-grounded natural language task, the goal is to generate a
script capable of fully executing the task. We run several strong baseline
language model agents on our benchmark. The strongest baseline, GPT-4, performs
the best on our benchmark However, its performance level still reaches only 15%
of the human proficiency in generating executable scripts capable of completing
the task, demonstrating the challenge of our task for conventional web agents.
Our benchmark provides a platform to measure and evaluate the progress of
language model agents in automating computer tasks and motivates future work
towards building multimodal models that bridge large language models and the
visual grounding of computer screens.

通过使用 OmniACT 数据集和基准测试，该研究介绍了评估代理程序生成可执行计算机任务的能力的一种新方法，并展示了当前最强的基线语言模型代理（GPT-4）在该基准测试中表现最好。然而，与人类能力相比，它仅达到 15％，这突显了传统网络代理在生成可完成任务的可执行脚本方面的挑战。该基准测试为衡量和评估语言模型代理在自动化计算机任务方面的进展提供了平台，并激励未来研究努力构建大型语言模型和计算机屏幕的视觉基础的多模态模型。

OmniACT：实现桌面和网络的多模态通用自主代理的数据集和基准

OmniACT: A Dataset and Benchmark for Enabling Multimodal Generalist  Autonomous Agents for Desktop and Web

Semantic parsing aims at translating natural language (NL) utterances onto
machine-interpretable programs, which can be executed against a real-world
environment. The expensive annotation of utterance-program pairs has long been
acknowledged as a major bottleneck for the deployment of contemporary neural
models to real-life applications. In this work, we focus on the task of
semi-supervised learning where a limited amount of annotated data is available
together with many unlabeled NL utterances. Based on the observation that
programs which correspond to NL utterances must be always executable, we
propose to encourage a parser to generate executable programs for unlabeled
utterances. Due to the large search space of executable programs, conventional
methods that use approximations based on beam-search such as self-training and
top-k marginal likelihood training, do not perform as well. Instead, we view
the problem of learning from executions from the perspective of posterior
regularization and propose a set of new training objectives. Experimental
results on Overnight and GeoQuery show that our new objectives outperform
conventional methods, bridging the gap between semi-supervised and supervised
learning.

本文针对半监督学习中 NL utterances 匹配 program 的任务，提出了一种新的方法 —— 鼓励 parser 为未标注 utterances 生成可执行的 program，并从后验正则化的角度提出了一组新的训练目标，实验显示这些新目标优于常规方法，使半监督和监督学习之间的差距缩小。