Publications


Dynamic Neural Portraits

Michail Christos Doukas, Stylianos Ploumpis, Stefanos Zafeiriou
In IEEE/CVF Winter Conference on Applications of Computer Vision 2023 (WACV 2023)

Abstract: We present Dynamic Neural Portraits, a novel approach to the problem of full-head reenactment. Our method generates photo-realistic video portraits by explicitly controlling head pose, facial expressions and eye gaze. Our proposed architecture is different from existing methods that rely on GAN-based image-to-image translation networks for transforming renderings of 3D faces into photo-realistic images. Instead, we build our system upon a 2D coordinate-based MLP with controllable dynamics. Our intuition to adopt a 2D-based representation, as opposed to recent 3D NeRF-like systems, stems from the fact that video portraits are captured by monocular stationary cameras, therefore, only a single viewpoint of the scene is available. Primarily, we condition our generative model on expression blendshapes, nonetheless, we show that our system can be successfully driven by audio features as well. Our experiments demonstrate that the proposed method is 270 times faster than recent NeRF-based reenactment methods, with our networks achieving speeds of 24 fps for resolutions up to 1024 x 1024, while outperforming prior works in terms of visual quality.

[Project Page] [arXiv]

imgs/DynamicNeuralPortraits.png

Video Demo:


Free-HeadGAN: Neural Talking Head Synthesis with Explicit Gaze Control

Michail Christos Doukas, Evangelos Ververas, Viktoriia Sharmanska, Stefanos Zafeiriou
Published in IEEE Transactions on Pattern Analysis and Machine Intelligence (TPAMI), 7 March 2023
DOI: 10.1109/TPAMI.2023.3253243

Abstract: We present Free-HeadGAN, a person-generic neural talking head synthesis system. We show that modeling faces with sparse 3D facial landmarks is sufficient for achieving state-of-the-art generative performance, without relying on strong statistical priors of the face, such as 3D Morphable Models. Apart from 3D pose and facial expressions, our method is capable of fully transferring the eye gaze, from a driving actor to a source identity. Our complete pipeline consists of three components: a canonical 3D key-point estimator that regresses 3D pose and expression-related deformations, a gaze estimation network and a generator that is built upon the architecture of HeadGAN. We further experiment with an extension of our generator to accommodate few-shot learning using an attention mechanism, in case multiple source images are available. Compared to recent methods for reenactment and motion transfer, our system achieves higher photo-realism combined with superior identity preservation, while offering explicit gaze control.

[Paper] [arXiv (pre-print)]

imgs/free_headGAN-system.png

imgs/free_headGAN-teaser.png


HeadGAN: One-shot Neural Head Synthesis and Editing

Michail Christos Doukas, Stefanos Zafeiriou, Viktoriia Sharmanska
Accepted in IEEE International Conference on Computer Vision 2021 (ICCV 2021)

Abstract: Recent attempts to solve the problem of head reenactment using a single reference image have shown promising results. However, most of them either perform poorly in terms of photo-realism, or fail to meet the identity preservation problem, or do not fully transfer the driving pose and expression. We propose HeadGAN, a novel system that conditions synthesis on 3D face representations, which can be extracted from any driving video and adapted to the facial geometry of any reference image, disentangling identity from expression. We further improve mouth movements, by utilising audio features as a complementary input. The 3D face representation enables HeadGAN to be further used as an efficient method for compression and reconstruction and a tool for expression and pose editing.

[Project Page] [Code] [arXiv]

imgs/4.png

Video Demo:


Head2HeadFS: Video-based Head Reenactment with Few-shot Learning

Michail Christos Doukas, Mohammad Rami Koujan, Viktoriia Sharmanska, Stefanos Zafeiriou

Abstract: Over the past years, a substantial amount of work has been done on the problem of facial reenactment, with the solutions coming mainly from the graphics community. Head reenactment is an even more challenging task, which aims at transferring not only the facial expression, but also the entire head pose from a source person to a target. Current approaches either train person-specific systems, or use facial landmarks to model human heads, a representation that might transfer unwanted identity attributes from the source to the target. We propose head2headFS, a novel easily adaptable pipeline for head reenactment. We condition synthesis of the target person on dense 3D face shape information from the source, which enables high quality expression and pose transfer. Our video-based rendering network is fine-tuned under a few-shot learning strategy, using only a few samples. This allows for fast adaptation of a generic generator trained on a multiple-person dataset, into a person-specific one.

[arXiv]

imgs/5.png

Video Demo:


Head2Head++: Deep Facial Attributes Re-Targeting

Michail Christos Doukas*, Mohammad Rami Koujan*, Viktoriia Sharmanska, Anastasios Roussos, Stefanos Zafeiriou
Published in IEEE Transactions on Biometrics, Behavior, and Identity Science (Volume: 3, Issue: 1, Jan. 2021)
DOI: 10.1109/TBIOM.2021.3049576
(* denotes equal contribution)

Abstract: Facial video re-targeting is a challenging problem aiming to modify the facial attributes of a target subject in a seamless manner by a driving monocular sequence. We leverage the 3D geometry of faces and Generative Adversarial Networks (GANs) to design a novel deep learning architecture for the task of facial and head reenactment. Our method is different to purely 3D model-based approaches, or recent image-based methods that use Deep Convolutional Neural Networks (DCNNs) to generate individual frames. We manage to capture the complex non-rigid facial motion from the driving monocular performances and synthesise temporally consistent videos, with the aid of a sequential Generator and an ad-hoc Dynamics Discriminator network. We conduct a comprehensive set of quantitative and qualitative tests and demonstrate experimentally that our proposed method can successfully transfer facial expressions, head pose and eye gaze from a source video to a target subject, in a photo-realistic and faithful fashion, better than other state-of-the-art methods. Most importantly, our system performs end-to-end reenactment in nearly real-time speed (18 fps).

[Paper] [Code] [arXiv]

imgs/3.png

Video Demo:


ReenactNet: Real-time Full Head Reenactment

Mohammad Rami Koujan*, Michail Christos Doukas*, Anastasios Roussos, Stefanos Zafeiriou
In IEEE International Conference on Automatic Face and Gesture Recognition 2020 (FG 2020).
10.1109/FG47880.2020.00049
(* denotes equal contribution)

Abstract: Video-to-video synthesis is a challenging problem aiming at learning a translation function between a sequence of semantic maps and a photo-realistic video depicting the characteristics of a driving video. We propose a head-to-head system of our own implementation capable of fully transferring the human head 3D pose, facial expressions and eye gaze from a source to a target actor, while preserving the identity of the target actor. Our system produces high-fidelity, temporally-smooth and photo-realistic synthetic videos faithfully transferring the human time-varying head attributes from the source to the target actor. Our proposed implementation: 1) works in real time (~20 fps), 2) runs on a commodity laptop with a webcam as the only input, 3) is interactive, allowing the participant to drive a target person, e.g. a celebrity, politician, etc, instantly by varying their expressions, head pose, and eye gaze, and visualising the synthesised video concurrently.

[Paper] [arXiv]

Video Demo:


Head2Head: Video-based Neural Head Synthesis

Mohammad Rami Koujan*, Michail Christos Doukas*, Anastasios Roussos, Stefanos Zafeiriou
In IEEE International Conference on Automatic Face and Gesture Recognition 2020 (FG 2020).
DOI: 10.1109/FG47880.2020.00048
(* denotes equal contribution)

Abstract: In this paper, we propose a novel machine learning architecture for facial reenactment. In particular, contrary to the model-based approaches or recent frame-based methods that use Deep Convolutional Neural Networks (DCNNs) to generate individual frames, we propose a novel method that (a) exploits the special structure of facial motion (paying particular attention to mouth motion) and (b) enforces temporal consistency. We demonstrate that the proposed method can transfer facial expressions, pose and gaze of a source actor to a target video in a photo-realistic fashion more accurately than state-of-the-art methods.

[Paper] [Code] [arXiv]

imgs/2.png

Video Demo:


Video-to-Video Translation for Visual Speech Synthesis

Michail Christos Doukas, Viktoriia Sharmanska, Stefanos Zafeiriou

Abstract: Despite remarkable success in image-to-image translation that celebrates the advancements of generative adversarial networks (GANs), very limited attempts are known for video domain translation. We study the task of video-to-video translation in the context of visual speech generation, where the goal is to transform an input video of any spoken word to an output video of a different word. This is a multi-domain translation, where each word forms a domain of videos uttering this word. Adaptation of the state-of-the-art image-to-image translation model (StarGAN) to this setting falls short with a large vocabulary size. Instead we propose to use character encodings of the words and design a novel character-based GANs architecture for video-to-video translation called Visual Speech GAN (ViSpGAN). We are the first to demonstrate video-to-video translation with a vocabulary of 500 words.

[arXiv]

imgs/1.png

Results:

School -> Tomorrow Changes -> Significant
imgs/11.gif imgs/12.gif