1 Introduction
The goal of this project is to replace the face in a target video with the face in a source video while
retaining the target person’s expressions. To achieve this, we needed to use feature detection and
optical flow to identify facial features and see how they move across frames. We needed warping
and gradient domain blending to retain the target emotions of the source face and smoothly
integrate the source face onto the target. Using open source algorithms (since we chose track 2),
we were able to achieve face swapping in a video.
Fig 1. Visual overview of project aims
2 Methodology
First, we collected the first frame of the target video and the source frame that we want to warp
onto the target, and we normalized both using CLAHE (contrast limited adaptive histogram
equalization) to increase the accuracy of feature detection in variable lighting conditions [1].
Fig 2. Results of various Histogram equalisations
Next, we detected 68 facial feature landmarks in both source and target frame using the Dlib
Python API [2]. Since the relative positions of each facial feature is consistent with each round of
Dlib facial landmark detection[6] , we were able to use these points to warp the source landmarks
locations into the target landmarks locations. With this, we retained target emotions on the
source’s face. Using the outer edge of facial landmarks, we created the mask for gradient domain
blending using cv2.ConvexHull. We create Delaunay Triangulation[9] and cv2.warpAffine to warp
the face regions[5,8]. From there, we used cv2.seamlessClone to blend the source face onto the
target [3].
Fig 3. Facial landmarks used to warp source image and create mask for blending
For the following frame, we normalize contrast, and detect facial landmarks on the new target
frame. Simultaneously, using KLT tracking, we determined the displacements of landmarks
between the previous frame and the current frame with cv2.calcOpticalFlowPyrLK [4]. With the
positions of each of the 68 facial landmarks according to Dlib and KLT, we calculated a weighted
average that produces a more accurate location for each feature point for warping (Fig 6). With
the new positions of the facial landmarks, we warp source points to target points and blend as
mentioned above.
Fig 4. Pipeline for processing each frame for face swapping
2.1 Identified and Addressed Challenges
Three challenges we expected to face were: (1) fast-moving features; (2) brightness changes; and
(3) side profiles. To keep tracking of fast-moving features, we incorporated both Dlib and KLT
tracking into our weighted average motion compensation to smooth out the positioning of
features over time. This weighted average was also incorporated to address the challenge of side
profiles. As seen in Fig 5, Dlib alone on a side profile image caused landmark identification to the
right side of where the face is. This is because Dlib always returned all 68 landmarks, even if they
are missing. The weighted average alots greater consideration to KLT, which would determine
that the feature point has shifted relative to the rest of the face. For brightness changes, we ran
contrast equalization with CLAHE for each frame to eliminate some shadows and dim regions in an
image before running facial landmark detection.
Fig 5. Dlib feature detection vs. our algorithm performance in side profile
2.2 Motion Compensation
For motion compensation, we used the weighted average of the landmark position as calculated
by KLT using the landmark’s position in the prior frame and of the landmark position as identified
by Dlib. For each feature, in the case that both algorithms successfully detected the feature in the
target frame, we simply did a weighted average in which the KLT results had more influence
(80%). This was set because Dlib detects all 68 landmarks consistently, even in the case of a side
profile. However, a side profile often hides portions of the face. Therefore, Dlib would incorrectly
identify peripheral landmarks that, in fact, do not reside on the image.
In the case that only Dlib identified a given feature, we gave 80% of the weight to the position of
the feature in the previous frame. If the case that only KLT detected a feature, we gave 20% to the
previous frame. If neither KLT nor Dlib detected a given landmark, the position of the feature
remained the same as its position for the previous frame. However, if a total of 45 of 68 features
fell under this last criteria for a given frame, we had the algorithm not blend a face onto the target.
Fig 6. Pseudocode for motion compensation weighted average
3 Results
Please follow this link to view videos produced with our algorithm.
https://drive.google.com/drive/folders/1AHj37F7oJQyf7A9dlYWHfrNql7Tdoj6-?usp=sharing
4 Future Work
To improve results, future work should use head pose estimation to identify the side
profiles of all frames in the source and target videos. From there, a target frame with a
particular head pose could be matched to a source frame with a similar head pose. This
would improve the side profile warps and blends. Additionally, the feature detection in
shadows and dim lighting conditions could be improved in future work. One way to do this
is to convert the image to different color spaces and remove one or all but one color
channels for the frame before running feature detection. It is possible that a large portion
of a shadow would be eliminated by the removal of channels, thus helping feature
detection locate landmarks. Lastly, our algorithm performed poorly when warping a
closed mouth without teeth into an open mouth and compressed an open mouth with
teeth into a closed mouth. To improve this, we can use a convolutional neural network
trained for teeth detection to identify a source frame that has the same degree of teeth
exposure as the target frame [7]. That identified frame can then be used for warping.
Further, we can find the frame of the target that has the closest matching option with the
source. We can use a model fitting approach to determine this frame, use euclidean
distance between all the 68 feature points, and then fit a line to this. We could pick a
threshold and choose the frame with the maximum number of inliers and then apply the
methodology as above.
5 References
1. https://docs.opencv.org/3.1.0/d5/daf/tutorial_py_histogram_equalization.html
2. https://pypi.python.org/pypi/dlib
3. https://docs.opencv.org/3.0-beta/modules/photo/doc/cloning.html
4. https://docs.opencv.org/3.3.1/d7/d8b/tutorial_py_lucas_kanade.html
5. https://www.learnopencv.com/
6. https://www.pyimagesearch.com/
7. https://juanzdev.github.io/TeethClassifier/
8. https://www.learnopencv.com/warp-one-triangle-to-another-using-opencv-c-python/
9. https://docs.scipy.org/doc/scipy-0.14.0/reference/generated/scipy.spatial.Delaunay.html