Contrastive language-audio pretraining~(CLAP) has been developed to align the
representations of audio and language, achieving remarkable performance in
retrieval and classification tasks. However, current CLAP struggles to capture
temporal information within audio and text features, presenting substantial
limitations for tasks such as audio retrieval and generation. To address this
gap, we introduce T-CLAP, a temporal-enhanced CLAP model. We use Large Language
Models~(LLMs) and mixed-up strategies to generate temporal-contrastive captions
for audio clips from extensive audio-text datasets. Subsequently, a new
temporal-focused contrastive loss is designed to fine-tune the CLAP model by
incorporating these synthetic data. We conduct comprehensive experiments and
analysis in multiple downstream tasks. T-CLAP shows improved capability in
capturing the temporal relationship of sound events and outperforms
state-of-the-art models by a significant margin.

使用大型语言模型和混淆策略生成音频剪辑的时序对比性描述，并设计新的时序对比损失函数来改进对比性语音 - 文本预训练模型 (T-CLAP)，结果在多个下游任务中显示出更强的捕捉音频事件时序关系的能力并显著超越了最先进的模型。

T-CLAP：时间增强对比语言 - 音频预训练

T-CLAP: Temporal-Enhanced Contrastive Language-Audio Pretraining

Videos for mobile devices become the most popular access to share and acquire
information recently. For the convenience of users' creation, in this paper, we
present a system, namely MobileVidFactory, to automatically generate vertical
mobile videos where users only need to give simple texts mainly. Our system
consists of two parts: basic and customized generation. In the basic
generation, we take advantage of the pretrained image diffusion model, and
adapt it to a high-quality open-domain vertical video generator for mobile
devices. As for the audio, by retrieving from our big database, our system
matches a suitable background sound for the video. Additionally to produce
customized content, our system allows users to add specified screen texts to
the video for enriching visual expression, and specify texts for automatic
reading with optional voices as they like.

MobileVidFactory 是一个系统，用于自动生成垂直移动视频，用户只需提供简单的文本，通过利用预训练的图像扩散模型和音频检索来生成高质量、个性化的移动视频。

MobileVidFactory：基于文本的自动扩散社交媒体视频生成移动设备

MobileVidFactory: Automatic Diffusion-Based Social Media Video  Generation for Mobile Devices from Text

In this paper, we present MovieFactory, a powerful framework to generate
cinematic-picture (3072$\times$1280), film-style (multi-scene), and
multi-modality (sounding) movies on the demand of natural languages. As the
first fully automated movie generation model to the best of our knowledge, our
approach empowers users to create captivating movies with smooth transitions
using simple text inputs, surpassing existing methods that produce soundless
videos limited to a single scene of modest quality. To facilitate this
distinctive functionality, we leverage ChatGPT to expand user-provided text
into detailed sequential scripts for movie generation. Then we bring scripts to
life visually and acoustically through vision generation and audio retrieval.
To generate videos, we extend the capabilities of a pretrained text-to-image
diffusion model through a two-stage process. Firstly, we employ spatial
finetuning to bridge the gap between the pretrained image model and the new
video dataset. Subsequently, we introduce temporal learning to capture object
motion. In terms of audio, we leverage sophisticated retrieval models to select
and align audio elements that correspond to the plot and visual content of the
movie. Extensive experiments demonstrate that our MovieFactory produces movies
with realistic visuals, diverse scenes, and seamlessly fitting audio, offering
users a novel and immersive experience. Generated samples can be found in
YouTube or Bilibili (1080P).

本篇论文介绍了 MovieFactory 框架，用于根据自然语言需求生成影视作品，其中自动化电影生成模型、自然语言处理方法、文本到图像模型、音频检索等方法都有所涉及。

MovieFactory: 利用大型语言和图像生成模型从文本自动生成电影

MovieFactory: Automatic Movie Creation from Text using Large Generative  Models for Language and Images

We consider the task of retrieving audio using free-form natural language
queries. To study this problem, which has received limited attention in the
existing literature, we introduce challenging new benchmarks for text-based
audio retrieval using text annotations sourced from the Audiocaps and Clotho
datasets. We then employ these benchmarks to establish baselines for
cross-modal audio retrieval, where we demonstrate the benefits of pre-training
on diverse audio tasks. We hope that our benchmarks will inspire further
research into cross-modal text-based audio retrieval with free-form text
queries.

本研究介绍了新的基准，使用自由形式的自然语言查询中的文本注释，旨在研究文本查询的语音检索问题，同时探讨跨模态音频检索的优势和基准，以及优化方法。