新闻与深度文章
编者按:随着多模态和推理模型的快速发展,能够自主理解并操作计算机界面的智能体(Computer Use Agent, CUA)正逐渐成为现实。其中,图形界面定位(GUI Grounding)是实现这一能力的核心环节,它决定了智能体能否准确地完成点击、输入等具体操作。然而,现有模型在关键基准测试中的准确率仍较低,距离实际应用尚有差距。对此,微软亚洲研究院近期发布了技术报告系统分析了 GUI Grou…
| Yale Song
The natural association between visual observations and their corresponding sounds has exhibited powerful self-supervision signals for learning video representations, which makes the ever-growing amount of online video an attractive data source for self-supervised learning. However, online videos often provide imperfectly…
| Yale Song
Understanding video is one of the most challenging problems in AI, and an important underlying requirement is learning multimodal representations that capture information about objects, actions, sounds, and their long-range statistical dependencies from audio-visual signals. Recently, transformers have been successful in…