In contrast to traditional GAN-based image-to-image translation approaches to full-head reenactment (a), or recent NeRF-based video portrait rendering methods (b), we propose Dynamic Neural Portraits, a novel paradigm for controllable video portrait synthesis, composed of an MLP and a CNN-based decoder (c).

Method

Our method draws inspiration from implicit neural representations (INR) and conditionally independent pixel synthesis (CIPS). An intuitive solution for modeling the video portrait with a neural network would be to estimate the pixel’s RGB value from the i-th video frame with the MLP, using as input the pixel’s coordinates as well as the tracked pose and expression parameters.

However, we found that we can obtain superior results in terms of visual quality and rendering speed by combing the MLP with a convolutional decoder network. Following the paradigm of recent 3D aware GANs, instead of predicting RGB colour values we propose an MLP that maps its input to a visual feature vector. We first evaluate the MLP network at each spatial position and accumulate the resulting features in a visual feature map. Then, we employ a CNN-based decoder, which receives the feature map and performs up-sampling to synthesise the output frame. Primarily, we condition our generative model on expression blendshapes, nonetheless, we show that our system can be successfully driven by audio features as well.

Results

Citation

In case you find this work useful to your research, please cite our paper.

					
@inproceedings{doukas2023dynamic,
  title={Dynamic Neural Portraits},
  author={Doukas, Michail Christos and Ploumpis, Stylianos and Zafeiriou, Stefanos},
  booktitle={Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision},
  pages={4073--4083},
  year={2023}
}