EgoMemory: Memory-Augmented Personalized Retrieval for Long-Context Egocentric Video
- Yuanmin Tang ,
- Jue Zhang ,
- Xiaoting Qin ,
- Jing Yu ,
- Meikang Qiu ,
- Gaopeng Gou ,
- Gang Xiong ,
- Qingwei Lin 林庆维 ,
- Saravan Rajmohan ,
- Dongmei Zhang ,
- Qi Wu
ACL Findings |
Recent advances in AI and wearable devices, such as augmented-reality glasses, have made it possible to augment human memory by retrieving personal experiences in response to natural language queries. However, existing egocentric video datasets fall short in supporting the personalization and long-context reasoning required for episodic memory retrieval. To address these limitations, we introduce EgoMemory, a benchmark derived from Ego4D, enriched with 165,795 user-specific object annotations over 245 videos from 45 participants, yielding 639 distinct, human-curated, and evaluated queries for rich and individualized episodic memory retrieval. Leveraging this resource, we present EgoRetriever, a novel, training-free retrieval framework that combines Multimodal Large Language Models with reflective Chain-of-Thought prompting. Our approach enables interpretive inference of user intent and generates detailed target video descriptions by leveraging contextualized personal memory for video retrieval. Extensive experiments on three benchmarks, including EgoMemory, EgoCVR, and EgoLife, demonstrate that EgoRetriever consistently and substantially outperforms state-of-the-art baselines, highlighting its strong generalizability and practical potential for personalized, long-context egocentric video retrieval.