151
could have access to a large amount of speech from the target
(for the speech synthesis module), of a reasonable amount
of aligned speech data for source and target (for the voice
mapping module), and of some test speech+video data from
the source (in order to test the complete system). We chose to
use the CMU-ARCTIC databases as target speech [26]. This
database was designed specifically for building unit-selection-
based TTS systems. It is composed of eight voices, speaking
1150 sentences each (the same sentences, chosen to provide
a phonetically balanced corpus). We thus decided to record
an audiovisual complementary database, for use as the source
data.
The eNTERFACE06
ARCTIC database we have created is
composed of 199 sentences, spoken by one male speaker, and
uniformly sampled from the CMU
ARCTIC database [26]. For
each sentence, an .txt, a .avi, and a .wav files are available. The
.avi file contains images with 320x240 pixels, 30 frames per
second, of the speaker pronouncing the sentence (Fs=44100
Hz). The .wav file contains the same sound recording as in
the .avi file, but resampled to 16 kHz.
The database was recorded using a standard mini-DV digital
video camera. The recording of the speech signal was realized
through the use of a high-quality microphone, specially con-
ceived for speech recordings. The microphone was positioned
roughly 30cm below the subject’s mouth, outside the camera
view.
The background consisted of a monochromatic dark green
panel that covered the entire area behind the subject, to allow
easier face detection and tracking. Natural lighting was used,
so that some slight illumination variation can be encountered
among the files (Fig. 2).
The recordings were made using the NannyRecord tool
provided by UPC Barcelona, which makes it possible for
the speaker to hear the sentence it has to pronounce twice
before recording it. The source speaker used for the recordings
were the “awb” speaker of CMU
ARCTIC. The eNTER-
FACE06
ARCTIC speaker was asked to keep the prosody
(timing, pitch movements) of the source, while using his own
acoustic realization of phonemes, and of course, his voice
(i.e., not trying to imitate the target voice). This particular
setting has made it possible for the eNTERFACE
ARCTIC
recordings to be pretty much aligned with the corresponding
CMU
ARCTIC recordings.
Following the standard approach, the parallel database was
further divided into three subsets:
•
development set, consisting of 188 sentences (of the total
198), used during the training phase of the different
algorithms (alignment, voice mapping, etc.),
•
validation set, used to avoid overfitting during the re-
finement of the model parameters (number of clusters,
GMMs, etc.),
•
evaluation set, used to obtain the results of the multi-
modal conversion.
It is worth mentioning that this last subset (evaluation) was not
present in any of the stages of the training. The results we have
obtained can therefore be expected to generalize smoothly to
any new data.
D
T
W
Mapping
Selection
Frame
WSOLA
(target: mfcc)
Y
(source:mfcc)
X
a.GMM
b.CVQ
a.DP
b. By Example
y’[n]
{file,t}
Y’
Fig. 10. Block diagram show the different alternatives for the mapping
estimation and for the frame selection.
The features computed from the signals were either 13 or 20
MFCC’s and their first and second derivatives; the frame rate
was chosen to be 8ms. For the computation of the derivatives
we relied on an implementation by Dan Ellis
2
and the other
signal processing steps were taken from the MA toolbox
by Elias Pampalk
3
. We also computed estimations of the
fundamental frequency by using functionalities provided by
the Praat software
4
.For fundamental frequency and for signal
energy we can also provide first and second derivatives so that
for each frame the full set of features was: 20 MFCC’s, 20
∆MFCC’s, 20 ∆∆MFCC’s, f
0
,∆f
0
,∆∆f
0
,energy, ∆energy
and ∆∆energy resulting in a vector of 66 dimensions.
In figure 10, we can see the different alternatives we have
implemented for each of the blocks.
A. Alignment and voice mapping
After the first alignment the Euclidean distance (L
2
norm)
between the source and the target MFCCs was: 1210.23. Then
the GMM-based mapping was applied and the same norm
was measured on the Transformed data and the Target Data:
604.35 (improvement: 50.08%). Then the Source data were
transformed and a new alignment was performed. Using the
new aligned data, a new mapping function was estimated
and again the Source data were transformed, and again the
L
2
norm between the transformed data and the Target data
was measured (397.63). The process was repeated and a new
measurement of the performance of the mapping function was
measured using the L
2
norm (378.18).
Without iterations, we achieve a 50% reduction of distance
between the target and the source data. This percentage
should be higher, and provides also an information on the
difference between the two speakers, and/or of the differences
in the recording conditions. After some iteration we arrive to
stable mapping function, with an improvement over the initial
distance between the source and the target data of: 68.76%.
Figure 11 shows the reduction of the distortion due to the
iteration of the algorithm.
B. On clustering parameters
The incremental alignment procedure used a Gaussian Mix-
ture Model of the source space with 128 components. This
parameter remained fixed throughout the experiments. It had
influence on the construction of the aligned data as well as on
the mapping from source to target space, as this mapping is
2
http://labrosa.ee.columbia.edu/matlab/rastamat/deltas.m
3
http://www.ofai.at/˜elias.pampalk/ma/
4http://www.fon.hum.uva.nl/praat/