The proposed diffusion procedure seeks the elimination of video information that is not relevant for joint audio-visual processing. The criterion used to determine if a certain part is relevant is
4.5. Audio-Visual Diffusion Ratio α and Study of the Diffusion Parameter K 55
the synchrony between video motion and audio energy. Video parts whose motion is not coherent with the audio channel activity are affected by homogeneous diffusion. As a result, spatio-temporal edges in these regions are progressively smoothed. Looking at one frame we can observe that the intensity of the edges becomes close to their entourage, but the same happens across frames. Thus, the temporal edges in non-relevant regions are iteratively smoothed and the motion which is not related to the soundtrack is reduced. In fact, by observing the resultant video motion after some iterations we can discover where our algorithm places its attention, that is the possible location of the sound source in the image.
In this section we first define the diffusion ratio α as a measure to quantify the efficiency of the proposed method in removing the video motion that is not related to the sounds in the audio channel. Next, we discuss the value of the parameter K that better suits our objective of keeping only the video information that is needed in joint audio-visual processing.
LetL be a subset of the video domain Ω: L ⊂ Ω. Then, the amount of motion M in the video subsetL at iteration n is defined as
MLn:=
X
{i,j,k}∈L
|δ∗tvi,j,kn | , (4.22)
where|δ∗
tvni,j,k| is the absolute value of the temporal derivative approximation δ∗tv defined in equation
(4.14) at pixel coordinates{i, j, k}.
We define an audio-visual region of interest (ROI) as the subset of pixels in the video domain that are related to the soundtrack and the complementary region (ROI) as the rest of pixels in the video domain: ROI ∪ ROI = Ω. Then, we can define the audio-visual diffusion ratio α at iteration n as αn= M0 ROI Mn ROI M0 ROI Mn ROI aON , (4.23)
where the value M0
ROI/MROIn is the ratio between the amount of motion inside the region of interest
at iterations 0 (original motion) and n, and M0 ROI/M
n
ROI is the same ratio computed outside this
region of interest. Finally, [·]aON indicates that only the frames where the audio channel is active
(aON) are used in the computation of this ratio. The audio channel is considered active when
sounds are captured by the microphone and thus the normalized audio feature is large enough: a(t) > 0.1 with a(t)∈ [0, 1]. To summarize, the audio-visual diffusion ratio α is a relative measure that assesses the ability to attenuate the motion in parts of the video signal that are not related to the soundtrack by comparing it to the diffusion experienced in the audio-visual region of interest,
when sounds are present in the audio channel. Thus, the ratio α quantifies our efficiency only in
time slots where sounds are present. α > 1 when our method favors regions associated to the soundtrack, α = 1 when the video motion is equally eliminated inside and outside the ROI, and α < 1 if the diffusion affects more our ROI than the rest of the video signal in non-silent periods. Please notice that obtaining α > 1 is an extremely challenging task, especially in sequences where the audio-related motion is less intense than the distracting motion.
Let us now study the diffusion parameter K according to the quantitative efficiency measure α. A normalized audio-video synchrony sσ ∈ [0, 1] is used for this analysis. Figure 4.3 shows the
typical evolution through iterations of the diffusion ratio α when the audio-related video motion is similar in magnitude to the distracting motion [left] and its evolution when the distracting motion is much more intense and/or spread [right]. Each curve is obtained with a different value of the parameter K. The curves in this figure correspond to sequences in Figures 4.4 and 4.1 respectively,
56 Chapter 4. Joint Audio-Visual Processing using Nonlinear Diffusion 0 10 20 30 40 1 1.2 1.4 1.6 1.8
α
n
0 10 20 30 40 1 1.1 1.2 1.3α
n
Figure 4.3 –Evolution through iterations of the audio-visual diffusion ratio α when the motion related to the soundtrack has a similar [left] and a much smaller [right] magnitude than the distracting motion. The blue dash dot, black solid and red dashed curves depicted correspond to K = 0.05, 0.1, 0.15 respectively.
(a) Original (b) Result K = 0.05 (c) Result K = 0.1 (d) Result K = 0.15
Figure 4.4 – Results after applying 30 iterations of the proposed audio-visual diffusion procedure to a video sequence in terms of pixels intensity [top row] and variation or motion [bottom row] for different values of K. In this sequence a hand is playing a synthesizer while a rocking horse generates distracting motion.
which are taken from the state-of-the-art source localization work presented by Kidron et al. in [36]. In both cases a strong distracting motion is introduced by means of a rocking horse. Thus, while in the first case the magnitude of the audio-related video motion generated by a hand (ROI) playing a synthesizer is similar to the distracting motion, in the second sequence the movements in the mouth region (ROI) are clearly less visible than the rocking horse’s ones. As expected α is always above 1 and it reaches a larger value when the distracting motion and the audio-related video motion have similar intensity, i.e. α reaches a value around 1.7 in the left plot (less challenging case) while only 1.27 in the right one. When K = 0.05 (small value) the diffusion process evolves slowly in moving regions because there is a lot of irrelevant motion that is taken into account. Our method considers these very small 3D motion concentrations as possibly audio-related and thus it takes time to eliminate them and, as a result, α increases slowly (see the blue dash dot line in Figure 4.3 [left]).