Computer Vision Group

Empowering technologies for real-world vision-based systems

新闻与深度文章

文章

Phi-Ground模型：让AI学会“看屏幕”

2025年11月6日

编者按：随着多模态和推理模型的快速发展，能够自主理解并操作计算机界面的智能体（Computer Use Agent, CUA）正逐渐成为现实。其中，图形界面定位（GUI Grounding）是实现这一能力的核心环节，它决定了智能体能否准确地完成点击、输入等具体操作。然而，现有模型在关键基准测试中的准确率仍较低，距离实际应用尚有差距。对此，微软亚洲研究院近期发布了技术报告系统分析了 GUI Grou…

微软研究院博客

ACAV100M: Scaling up self-supervised audio-visual learning with automatically curated internet videos

2021年10月28日 | Yale Song

The natural association between visual observations and their corresponding sounds has exhibited powerful self-supervision signals for learning video representations, which makes the ever-growing amount of online video an attractive data source for self-supervised learning. However, online videos often provide imperfectly…

微软研究院博客

Microsoft and NVIDIA introduce parameter-efficient multimodal transformers for video representation learning

2021年5月17日 | Yale Song

Understanding video is one of the most challenging problems in AI, and an important underlying requirement is learning multimodal representations that capture information about objects, actions, sounds, and their long-range statistical dependencies from audio-visual signals. Recently, transformers have been successful in…