Interactive World Simulator
Learning to simulate the visual world from large-scale videos. In-context learning for vision data has been underexplored compared with that in natural language. Previous works studied image in-context learning, urging models to generate a single…
LLM2CLIP
LLM2CLIP is a novel approach that embraces the power of LLMs to unlock CLIP’s potential. By fine-tuning the LLM in the caption space with contrastive learning, we extract its textual capabilities into the output embeddings,…
Autoregressive Video Models
Driving large video models with next token prediction In-context learning for vision data has been underexplored compared with that in natural language. Previous works studied image in-context learning, urging models to generate a single image…