This research addresses the challenge of training an ASR model for
personalized voices with minimal data. Utilizing just 14 minutes of custom
audio from a YouTube video, we employ Retrieval-Based Voice Conversion (RVC) to
create a custom Common Voice 16.0 corpus. Subsequently, a Cross-lingual
Self-supervised Representations (XLSR) Wav2Vec2 model is fine-tuned on this
dataset. The developed web-based GUI efficiently transcribes and translates
input Hindi videos. By integrating XLSR Wav2Vec2 and mBART, the system aligns
the translated text with the video timeline, delivering an accessible solution
for multilingual video content transcription and translation for personalized
voice.

通过最小的数据量，利用检索式语音转换和自监督表示的方法，对个性化语音识别模型进行训练，实现多语言视频内容转录和翻译的无障碍解决方案。

使用经过微调的 XLSR Wav2Vec2 在自定义数据集和 mBART 上对视频进行转录和翻译

Transcription and translation of videos using fine-tuned XLSR Wav2Vec2  on custom dataset and mBART

This paper proposes two innovative methodologies to construct customized
Common Voice datasets for low-resource languages like Hindi. The first
methodology leverages Bark, a transformer-based text-to-audio model developed
by Suno, and incorporates Meta's enCodec and a pre-trained HuBert model to
enhance Bark's performance. The second methodology employs Retrieval-Based
Voice Conversion (RVC) and uses the Ozen toolkit for data preparation. Both
methodologies contribute to the advancement of ASR technology and offer
valuable insights into addressing the challenges of constructing customized
Common Voice datasets for under-resourced languages. Furthermore, they provide
a pathway to achieving high-quality, personalized voice generation for a range
of applications.

本文提出了两种创新方法，用于为印地语等低资源语言构建定制的 Common Voice 数据集。第一种方法利用 Suno 开发的基于 Transformer 的文本到音频模型 Bark，结合 Meta 的 enCodec 和预训练的 HuBert 模型来增强 Bark 的性能。第二种方法采用检索式语音转换（RVC），并使用 Ozen 工具包进行数据准备。这两种方法为 ASR 技术的发展做出了贡献，并提供了有价值的见解，以解决构建低资源语言定制 Common Voice 数据集的挑战。此外，它们为实现高质量个性化语音生成提供了途径，适用于各种应用场景。