The ICLR conference is unique among the top machine learning conferences in that all submitted papers are openly available. Here we present the ICLR dataset consisting of abstracts of all 24 thousand ICLR submissions from 2017-2024 with meta-data, decision scores, and custom keyword-based labels. We find that on this dataset, bag-of-words representation outperforms most dedicated sentence transformer models in terms of $k$NN classification accuracy, and the top performing language models barely outperform TF-IDF. We see this as a challenge for the NLP community. Furthermore, we use the ICLR dataset to study how the field of machine learning has changed over the last seven years, finding some improvement in gender balance. Using a 2D embedding of the abstracts' texts, we describe a shift in research topics from 2017 to 2024 and identify hedgehogs and foxes among the authors with the highest number of ICLR submissions.

ICLR会议提供了一个包含来自2017年至2024年所有24,000个ICLR提交摘要的数据集，研究发现基于词袋表示的模型在$k$NN分类准确性方面优于大多数句子转换模型，而顶级语言模型仅略胜于TF-IDF。此结果对NLP社区提出了挑战，并通过该数据集研究了近七年来机器学习领域的变化，发现性别平衡有所改善，并通过摘要文本的二维嵌入描述了2017年到2024年的研究主题变化，并确定了具有最多ICLR提交数量的作者中的创新者和专家。

学习表示的学习表示