In this work, we propose different variants of the self-attention based
network for emotion prediction from movies, which we call AttendAffectNet. We
take both audio and video into account and incorporate the relation among
multiple modalities by applying self-attention mechanism in a novel manner into
the extracted features for emotion prediction. We compare it to the typically
temporal integration of the self-attention based model, which in our case,
allows to capture the relation of temporal representations of the movie while
considering the sequential dependencies of emotion responses. We demonstrate
the effectiveness of our proposed architectures on the extended COGNIMUSE
dataset [1], [2] and the MediaEval 2016 Emotional Impact of Movies Task [3],
which consist of movies with emotion annotations. Our results show that
applying the self-attention mechanism on the different audio-visual features,
rather than in the time domain, is more effective for emotion prediction. Our
approach is also proven to outperform many state-ofthe-art models for emotion
prediction. The code to reproduce our results with the models' implementation
is available at: this https URL

本文提出不同变体的自我注意力机制网络，用于从电影中预测情感，融合音频和视频，结合多模态关系，并将自我注意力机制应用于情感预测的特征提取中，证明其在 COGNIMUSE 数据集和 MediaEval 2016 情感影响任务中比时域的自我注意力机制更有效，也优于很多最先进的情感预测模型。

AttendAffectNet: 基于自注意力网络从电影中预测情感响应

AttendAffectNet: Self-Attention based Networks for Predicting Affective  Responses from Movies

Video summarization aims to extract keyframes/shots from a long video.
Previous methods mainly take diversity and representativeness of generated
summaries as prior knowledge in algorithm design. In this paper, we formulate
video summarization as a content-based recommender problem, which should
distill the most useful content from a long video for users who suffer from
information overload. A scalable deep neural network is proposed on predicting
if one video segment is a useful segment for users by explicitly modelling both
segment and video. Moreover, we accomplish scene and action recognition in
untrimmed videos in order to find more correlations among different aspects of
video understanding tasks. Also, our paper will discuss the effect of audio and
visual features in summarization task. We also extend our work by data
augmentation and multi-task learning for preventing the model from early-stage
overfitting. The final results of our model win the first place in ICCV 2019
CoView Workshop Challenge Track.

该论文将视频摘要提出了内容为基础的推荐问题，使用可扩展的深度神经网络在显式建模的片段和视频上进行预测，通过场景和动作识别来寻找视频理解任务不同方面之间的相关性，同时讨论音频和视觉特征在总结任务中的影响，并通过数据增强和多任务学习来防止模型过度拟合。该模型最终在 ICCV 2019 CoView Workshop Challenge Track 中获得第一名。

综合视频理解：基于内容的视频推荐器设计的视频摘要

Comprehensive Video Understanding: Video summarization with  content-based video recommender design

We propose a novel approach for First Impressions Recognition in terms of the
Big Five personality-traits from short videos. The Big Five personality traits
is a model to describe human personality using five broad categories:
Extraversion, Agreeableness, Conscientiousness, Neuroticism and Openness. We
train two bi-modal end-to-end deep neural network architectures using
temporally ordered audio and novel stochastic visual features from few frames,
without over-fitting. We empirically show that the trained models perform
exceptionally well, even after training from a small sub-portions of inputs.
Our method is evaluated in ChaLearn LAP 2016 Apparent Personality Analysis
(APA) competition using ChaLearn LAP APA2016 dataset and achieved excellent
performance.

本文提出一种基于短视频中的大五人格特质进行第一印象识别的新方法，使用双模态深度神经网络结构和少量帧的音频和视觉特征进行训练，并在使用 ChaLearn LAP APA2016 数据集进行评估时表现出色。