Overview
Recent chatbots have demonstrated impressive ability to understand and communicate in raw-text form. However, there is more to the world than raw text. For example, humans spend long hours of their time on web pages, where text is intertwined with other modalities and tasks are accomplished in the form of various complex interactions. Can self-supervised state-of-the-art multi-modal models generalize to such complex domains?
To address this question, we introduce TurkingTest, a benchmark of tasks formulated as web pages containing textual instructions with multi-modal context. Unlike existing work which employs artificially synthesized web pages, here we use natural HTML pages that were originally designed for crowdsourcing workers for various annotation purposes. The HTML instructions of each task are also instantiated with various values (obtained from the crowdsourcing tasks) to form new instances of the task. This benchmark contains 32.2K instances distributed across 158 tasks.
Additionally, to facilitate the evaluation on TurkingTest, we develop an evaluation framework that connects the responses of chatbots to modifications on web pages (modifying a text box, checking a radio, etc.). We evaluate the performance of state-of-the-art models on this benchmark, testing a range of self-supervised models (language-only, vision-only, layout-only, and their combination). Our findings reveal that these models perform significantly better than random chance, yet considerable room exist for improvement. We hope this benchmark will help facilitate the evaluation and development of web-based agents.
Citation
@article{turkingbench2024xu, title={Tur[k]ingBench: A Tournament Among Web-based Agents}, author={Xu, Kevin and Kordi, Yeganeh and Sanders, Kate and Wang, Yizhong and Byerly, Adam and Zhang, Jack and Van Durme, Benjamin and Khashabi, Daniel}, year={2024}, eprint={https://arxiv.org/abs/2403.11905}, url={2403.11905}, archivePrefix={arXiv}, }