CV Group Header Image

Computer Vision Group

Empowering technologies for real-world vision-based systems

ニュース&特集

graphical user interface

記事

Phi-Ground模型：让AI学会“看屏幕”

11月 6, 2025

编者按：随着多模态和推理模型的快速发展，能够自主理解并操作计算机界面的智能体（C…

ACAV100M text on top of a series of small images.

Microsoft Research ブログ

ACAV100M: Scaling up self-supervised audio-visual learning with automatically curated internet videos

10月 28, 2021 | Yale Song

The natural association between visual o…

A graphic depicting audio and video content items passing through an audio transformer layer and a video transformer layer, respectively, before being combined while passing through a multimodal transformer layer

Microsoft Research ブログ

Microsoft and NVIDIA introduce parameter-efficient multimodal transformers for video representation learning

5月 17, 2021 | Yale Song

Understanding video is one of the most c…