Audio-Visual Speech Recognition (AVSR) uses lip-based video to improve
performance in noise. Since videos are harder to obtain than audio, the video
training data of AVSR models is usually limited to a few thousand hours. In
contrast, speech models such as Whisper are trained with hundreds of thousands
of hours of data, and thus learn a better speech-to-text decoder. The huge
training data difference motivates us to adapt Whisper to handle video inputs.
Inspired by Flamingo which injects visual features into language models, we
propose Whisper-Flamingo which integrates visual features into the Whisper
speech recognition and translation model with gated cross attention. Our
audio-visual Whisper-Flamingo outperforms audio-only Whisper on English speech
recognition and En-X translation for 6 languages in noisy conditions. Moreover,
Whisper-Flamingo is a versatile model and conducts all of these tasks using one
set of parameters, while prior methods are trained separately on each language.

Audio-Visual Speech Recognition (AVSR) uses Whisper-Flamingo, a model that integrates visual features, to improve speech recognition and translation performance in noisy conditions for multiple languages.

Whisper-Flamingo: 集成视觉特征于 Whisper 中用于音频 - 视觉语音识别和翻译

Whisper-Flamingo: Integrating Visual Features into Whisper for  Audio-Visual Speech Recognition and Translation

State-of-the-art anomalous sound detection systems often utilize angular
margin losses to learn suitable representations of acoustic data using an
auxiliary task, which usually is a supervised or self-supervised classification
task. The underlying idea is that, in order to solve this auxiliary task,
specific information about normal data needs to be captured in the learned
representations and that this information is also sufficient to differentiate
between normal and anomalous samples. Especially in noisy conditions,
discriminative models based on angular margin losses tend to significantly
outperform systems based on generative or one-class models. The goal of this
work is to investigate why using angular margin losses with auxiliary tasks
works well for detecting anomalous sounds. To this end, it is shown, both
theoretically and experimentally, that minimizing angular margin losses also
minimizes compactness loss while inherently preventing learning trivial
solutions. Furthermore, multiple experiments are conducted to show that using a
related classification task as an auxiliary task teaches the model to learn
representations suitable for detecting anomalous sounds in noisy conditions.
Among these experiments are performance evaluations, visualizing the embedding
space with t-SNE and visualizing the input representations with respect to the
anomaly score using randomized input sampling for explanation.

通过实验证明，最小化角度边缘损失还可以最小化紧凑性损失，从而避免学习平凡的解决方案，同时还能教会模型适用于在嘈杂条件下检测异常声音的表示方法。

为什么角边缘损失对半监督异常声音检测很有效？

Why do Angular Margin Losses work well for Semi-Supervised Anomalous  Sound Detection?

Audio-only-based wake word spotting (WWS) is challenging under noisy
conditions due to environmental interference in signal transmission. In this
paper, we investigate on designing a compact audio-visual WWS system by
utilizing visual information to alleviate the degradation. Specifically, in
order to use visual information, we first encode the detected lips to
fixed-size vectors with MobileNet and concatenate them with acoustic features
followed by the fusion network for WWS. However, the audio-visual model based
on neural networks requires a large footprint and a high computational
complexity. To meet the application requirements, we introduce a neural network
pruning strategy via the lottery ticket hypothesis in an iterative fine-tuning
manner (LTH-IF), to the single-modal and multi-modal models, respectively.
Tested on our in-house corpus for audio-visual WWS in a home TV scene, the
proposed audio-visual system achieves significant performance improvements over
the single-modality (audio-only or video-only) system under different noisy
conditions. Moreover, LTH-IF pruning can largely reduce the network parameters
and computations with no degradation of WWS performance, leading to a potential
product solution for the TV wake-up scenario.

本文提出了一种使用神经网络剪枝策略的紧凑型音视频唤醒词识别系统，该系统利用 MobileNet 对唇部信息进行编码，并与声学特征进行融合，大大提高了在不同噪声条件下的唤醒词识别性能，可望在电视开机场景下实现实际应用。