There are multiple applications to automatically count people and specify their gender at work, exhibitions, malls, sales, and industrial usage. Although current speech detection methods are supposed to operate well, in most situations, in addition to genders, the number of current speakers is unknown and the classification methods are not suitable due to many possible classes. In this study, we focus on a long-short-term memory convolutional neural network (LSTM-CNN) to extract time and / or frequency-dependent features of the sound data to estimate the number / gender of simultaneous active speakers at each frame in noisy environments. Considering the maximum number of speakers as 10, we have utilized 19000 audio samples with diverse combinations of males, females, and background noise in public cities, industrial situations, malls, exhibitions, workplaces, and nature for learning purposes. This proof of concept shows promising performance with training/validation MSE values of about 0.019/0.017 in detecting count and gender.

通过使用长短时记忆卷积神经网络（LSTM-CNN），本研究针对有噪音环境下的每帧音频数据，提取时间和/或频率相关的声音特征，从而估计同时活动的说话者数量和性别。在公共城市、工业环境、商场、展览会、工作场所和自然环境等各种情况下，使用了19000个男性、女性和背景噪音的音频样本进行了学习。该概念验证表明，在检测计数和性别方面，训练/验证均方误差（MSE）值约为0.019/0.017，显示出有希望的性能。

噪声环境中音频特征分析的LSTM-CNN网络