CVPR 2025 Photorealistic Avatar Challenge

Photorealistic avatars are human avatars that look, move, and talk like real people. Photorealistic avatars can be used for various applications, such as telecommunication, health care, education, retail and e-commerce, and entertainment. There has been a significant increase in research and publications about photorealistic avatars. CVPR 2024 included 36 papers on photorealistic avatars using a variety of methods. While there are a few common test sets that are used, there is no one test set that all use. Also, only quantitative metrics are used (PSNR, SSIM, and LPIPS) which have well known limitations like weak correlation to subjective realism. None of the CVPR 2024 photorealistic avatar papers included subjective tests for avatar performance, in part because this is a challenging task but also because there is no standardized or readily available method to do so.

In [1] we define and implement the first multidimensional measurement of photorealistic avatar quality of experience. We provide an open source implementation of the subjective test framework based on our extension to ITU-T P.910. We include subjective measurements of avatar realism, affinity, trust, comfortableness using, comfortableness interacting, appropriateness for work, creepiness, formality, resemblance to the person, emotion accuracy, and gesture accuracy. We show that except for resemblance the correlation of these subjective metrics to PSNR, SSIM, and LPIPS is weak; the correlation for emotion accuracy is moderate. For example, the avatar with the best PSNR, SSIM, and LPIPS in [1] (MS3_0 in Table 4) is average in terms of the subjective metrics. In other words, the objective metrics PSNR, SSIM, and LPIPS cannot be used to accurately stack rank the subjective performance of photorealistic avatars. The crowdsourced subjective test framework we have developed has been shown to be highly reproducible and accurate compared to a panel of experts. We also found that for avatars above a certain level of realism (mean opinion score > 2.5 in a 1-5 scale) these measured dimensions are highly correlated. In particular, for photorealistic avatars there is a linear relationship between avatar affinity and realism.

This challenge will provide a test set and methodology to subjectively evaluate photorealistic avatars for news anchor and telecommunication scenarios. Test subjects will be sitting or standing but only the upper half of the body is rendered. Speech, facial emotions, head turning, and hand gestures sequences are included.

Tracks 1 and 2

The challenge includes two tracks for half-body, (1) real-time and (2) non-real-time. Tracks 1 and 2 are evaluated on the same test set and target a face and upper torso avatar including hands. The real-time track must be evaluated on a NVIDIA RTX 4090 or equivalent GPU and must render the avatar at 1080p 30 FPS (each frame must be captured, processed, and rendered in less than 33 ms total). Teams can submit entries to tracks 1 and 2, but for the non-real-time track the total processing time per frame must be greater than 33 ms. The avatars must be causal and not use any future frames in the rendering of the current frame.

The challenge metric is the mean of the subjective dimensions defined in [1] with the addition of gesture accuracy. Specifically:

Challenge metric = mean(realism, resemblance to the person, emotion accuracy, gesture accuracy)

At least N=30 ratings per clip will be used for ranking the entries, and statistical tests will be done to determine ties. Additional metrics defined in [1] will be measured but will not be used as the challenge metric.

The test set will consist of a data from 10 people. It will consist of 5 males, 5 females, with a mix of Caucasian, Asian, and Black races for diversity. Each person has a 60 second enrollment clip with head motions, expressions, and speaking. The enrollment clip may be captured on different days as the test set. The test set consists of a 10 second speaking clip, a 15 second non-speaking clip that includes 6 emotions (happy, sad, anger, fear, surprised, and disgust), and a 20 second clip that includes hand gestures and head turns up to 90 degrees. An initial test set of 5 subjects will be provided at the beginning of the challenge, and the final test set will be provided 1 week before challenge submissions. The enrollment and test clips will be captured with a white background. Challenge submissions must be provided in 1080p mp4 format with CRF=17. The clips will be captured on recent iPhone cameras and will be provided in 4K 30 FPS. The baseline avatars will be [2] (which does well with gestures but is not as realistic as live video) and [3] (which is very realistic but does not handle gestures).

The test set is available at CVPR 2025 Photorealistic Avatar Challenge test set (opens in new tab).

Challenge input/output

The general process for the challenge is as follows:

  1. Create an avatar model
  2. For each enrollment clip
    • Enroll the avatar with the enrollment clip
    • For each test clip for that enrollment clip
      • Use the enrolled avatar to drive the avatar with the test clip
      • The rendered avatar should be shown from two viewpoints: 0 degrees (frontal) and 45 degrees. See the below figure.
      • The rendered avatar should look like the enrollment clip. The enrollment clip is the same person as the test clip but will likely be captured on different days and will be dressed differently.
Render viewpoints

Illustration of the output from the evaluations [1]:

Illustration of the output from the evals

Track 3

Track 3 is identical to Track 2 except the video input includes only the head and upper torso and does not include hand gestures. The challenge metric is:

Challenge metric = mean(realism, resemblance to the person, emotion accuracy)

The test set for track 3 will be very similar to track 1 and 2 and will include similar enrollment data. Track 3 must render the avatar at 1080p 30 FPS and at two viewpoints as described in tracks 1 and 2.

Teams can enter all tracks.

References

  1. A multidimensional measurement of photorealistic avatar quality of experience (opens in new tab). R Cutler, B Naderi, V Gopal, D Palle – arXiv preprint arXiv:2411.09066, 2024
  2. Z. Huang, F. Tang, Y. Zhang, X. Cun, J. Cao, J. Li, and T.-Y. Lee, “Make-Your-Anchor: A Diffusion-based 2D Avatar Generation Framework,” in CVPR, 2024
  3. J. Guo, D. Zhang, X. Liu, Z. Zhong, Y. Zhang, P. Wan, and D. Zhang, “LivePortrait: Efficient Portrait Animation with Stitching and Retargeting Control,” July 2024, arXiv:2407.03168.