3-D
Tracking of MPEG-4 Facial Features in Stereo Videos
Given
1.
A model of the head with MPEG-4 facial features identified on
it.
2.
Pose
of the face of the speaker in the first video fame.
3.
A
synchronized stereo video of the speaker
Required
Location of the facial features in the video frames, and
tracking of the features in 3-D.
Approach
Deform the head model to match the face recovered in each
video frame. Then, map features of the model to those of the face and
locate the positions of the features in 3-D.
Subtasks
1.
Construct a generic head model with facial features marked on
it.
2.
Construct
the face of the speaker.
3.
Deform
the generic head model to match the speaker's face.
4.
Calibrate
the stereo camera system.
5.
Track
the face of the speaker in 3-D.
A Four-Head Laser Range Scanner
As the generic model of the head, a bust of the Alexander the
Great is used. This bust is scanned by a four-head laser range scanner.

Fig. 1. Different views of the
bust of the Alexander the Great used as our generic head model.
Fig. 2. The structure of our four-head laser range
scanner (left). The structure of each head of the scanner (right).
Scanning
Method
-
Scan
is made from four sides of the head simultaneously, generating
cross-sections of the head with equispaced parallel laser planes.
-
By
stacking the cross-sections, the object is constructed.

Fig. 3. One set of images obtained by the four
cameras simultaneously, recording the surrounds of the head.

Fig. 4. Reconstructed model of the Alexander the Great.
A
Generic Model of the Head
-
The model obtained by scanning the bust of the Alexander
the Great is used as the generic head model.
-
This
model is represented in volumetric form. Voxels describing the surface
of the head are set to 1 while all other voxels are set to 0.

Fig. 5. The generic head model shown in volumetric
form (left) and surface form (right).
Selecting
the MPEG-4 Feature Points
MPEG-4 feature points are carefully marked on the bust of the
Alexander the Great with color stickers. By identifying the stickers in
the captured texture image, corresponding 3-D coordinates on the head
model are identified. Voxels of the generic model corresponding to the
MPEG-4 features are set to 2.

Fig. 6. MPEG-4 facial features marked on the
bust of the Alexander the Great .
Constructing
a Model of the Speaker's Face
-
A model of the speaker's face is needed to locate and
track facial features in 3-D.
-
A
model of the speaker's face is constructed by sweeping an eye-safe
laser over the speakers face and processing the captured stereo
images.
-
The
model is represented in volumetric form: Voxels belonging to the
surface of the face are set to 1, while other voxels in the volume are
set to 0.

Fig. 7. An example face (left) and a volumetric model
of the face (right).
Identifying
MPEG-4 Features on the Speaker's Face
-
This
is achieved by elastically deforming the generic head model to match
the speaker's face model and locating the positions of feature points
of the generic model in the speaker model.
-
The
elastic-matching process consists of two steps. 1) The generic and
speaker models are matched assuming they are rigid bodies with chamfer
matching. 2) The generic model is deformed in an energy-minimizing
process to locally match the speaker model. From the correspondences,
and knowing the positions of feature points in the generic head,
feature points in the speaker model are located.
Matching
of Deformable Objects
-
Starting
from the pose determined by chamfer matching, define internal energy
for each voxel as the maximum distance between that voxel and voxels in
its neighborhood on the same surface. Define external energy as the
distance of a voxel in
one surface to closest voxel in the other surface.
-
Iteratively
revise voxel positions of the generic model in an attempt to minimize
the energy of matching. The generic model at minimum energy will
represent deformation of the generic model to match the speaker model.
-
For
each voxel belonging to the speaker model, if there is a coinciding
voxel in the deformed generic model, a unique correspondence is
obtained. For other voxels in the speaker model the voxels closest to
them are taken as correspondences. (As the internal energy is
decreased, the number of such voxels decreases.)
Stereo
Camera Setup
-
Each
camera captures 480x640 one-byte-per-pixel images at 60 frames per
second. RGB colors are interleaved so that at each pixel only one of
the R, G, or B are recorded. By interpolation, RGB values at each pixel
are recovered.
-
The
cameras are synchronized so that stereo images are captured
simultaneously.
-
Images
are first saved in main memory and then saved to optical disks. Images
from optical disks are read and processed later.

Fig. 10. Organization of the stereo system.

Fig. 11. A raw image obtained by one of the cameras
(left). Image after
recovering color (right).
Stereo
Camera Calibration
-
In
order to relate the coordinates of points in stereo images to points in
3-D, it is required to calibrate the cameras.
-
Calibration
is achieved by rotating the cameras inward so that their optical axes
intersect at about the distance where the speaker will appear in front
of the cameras.
-
Coordinates
of four non-coplanar points in the scene and their correspondences in
the images are used to compute the calibration parameters.
-
Lens
distortion is ignored. Other errors potentially affecting the 3-D
measurements are errors in coordinates of 3-D points needed to
determine the calibration parameters.

Fig. 12. The set up used to calibrate the
stereo camera system.
Determining
the Pose of the Speaker's Face
-
Right
before speaking, an eye-safe laser line is swept over the
speakers face. The laser line is kept vertical during sweeping to
maximize the correspondence accuracy.
-
3-D
facial points obtained by matching laser points in stereo images are
entered into a volumetric image to obtain a volumetric representation
for the face right before speaking. This volumetric image contains
information about the pose of the face of the speaker.
-
This
volumetric image is matched with the head model of the speaker first
via chamfer matching and then via elastic matching to locate the head
of the speaker. This matching enables identifying MPEG-4 features on
the speakers head.

Fig. 13. A laser line is swept over the
speaker's face while capturing the face in stereo.

Fig. 14. Corresponding points on a laser line in a
stereo pair.

Fig. 15. From stereo correspondence, 3-D coordinates
of points on the laser lines are determined.
Determining the Pose and Facial Expression of the Speaker at
each Video Frame
-
Using
facial points at three consecutive frames, the positions of the points
in the next frame are estimated by quadratic estimators.
-
In
the neighborhood of the estimated points, we perform searches to find
corresponding points in the images. The estimated positions are refined
using the correspondences.
-
We
perform search using information about the model under
consideration--if correspondences are inconsistent with the local shape
of the model face, the correspondence is discarded and search for the
correspondences is repeated until obtained result becomes consistent
with local shape of face.
-
The
correspondence between the face in a video frame and the face in the
preceding frame is used to locate the MPEG-4 features on the face.
-
Both
shape and color information are used when finding correspondences.