In the scenario-based evaluation of machine learning models, a key problem is
how to construct test datasets that represent various scenarios. The
methodology proposed in this paper is to construct a benchmark and attach
metadata to each test case. Then a test system can be constructed with test
morphisms that filter the test cases based on metadata to form a dataset.
The paper demonstrates this methodology with large language models for code
generation. A benchmark called ScenEval is constructed from problems in
textbooks, an online tutorial website and Stack Overflow. Filtering by scenario
is demonstrated and the test sets are used to evaluate ChatGPT for Java code
generation.
Our experiments found that the performance of ChatGPT decreases with the
complexity of the coding task. It is weakest for advanced topics like
multi-threading, data structure algorithms and recursive methods. The Java code
generated by ChatGPT tends to be much shorter than reference solution in terms
of number of lines, while it is more likely to be more complex in both
cyclomatic and cognitive complexity metrics, if the generated code is correct.
However, the generated code is more likely to be less complex than the
reference solution if the code is incorrect.

该研究论文介绍了一种基于场景的机器学习模型评估方法，并构建了一个基准测试集，用于代码生成任务的评估。实验证明，ChatGPT 在复杂的编码任务中表现最差，生成的代码行数通常比参考解决方案少，但在圈复杂度和认知复杂度方面更复杂，如果生成的代码正确，它往往比参考解决方案少复杂度，如果生成的代码不正确，则往往比参考解决方案少复杂度。

ScenEval：代码生成场景评估的基准

ScenEval: A Benchmark for Scenario-Based Evaluation of Code Generation

Out-of-distribution (OOD) detection is the problem of identifying inputs
which are unrelated to the in-distribution task. The OOD detection performance
when the in-distribution (ID) is ImageNet-1K is commonly being tested on a
small range of test OOD datasets. We find that most of the currently used test
OOD datasets, including datasets from the open set recognition (OSR)
literature, have severe issues: In some cases more than 50$\%$ of the dataset
contains objects belonging to one of the ID classes. These erroneous samples
heavily distort the evaluation of OOD detectors. As a solution, we introduce
with NINCO a novel test OOD dataset, each sample checked to be ID free, which
with its fine-grained range of OOD classes allows for a detailed analysis of an
OOD detector's strengths and failure modes, particularly when paired with a
number of synthetic "OOD unit-tests". We provide detailed evaluations across a
large set of architectures and OOD detection methods on NINCO and the
unit-tests, revealing new insights about model weaknesses and the effects of
pretraining on OOD detection performance. We provide code and data at
this https URL

提出一种新的测试集 NINCO 以及相应的合成 OOD 单元测试来更准确地评估模型在 Out-of-distribution 检测中的表现，并针对预训练对 OOD 检测性能的影响进行了详细的评估。

In or Out? 修正 ImageNet 数据集的识别准确度

In or Out? Fixing ImageNet Out-of-Distribution Detection Evaluation

Artificial intelligence (AI) solutions that automatically extract information
from digital histology images have shown great promise for improving
pathological diagnosis. Prior to routine use, it is important to evaluate their
predictive performance and obtain regulatory approval. This assessment requires
appropriate test datasets. However, compiling such datasets is challenging and
specific recommendations are missing.
A committee of various stakeholders, including commercial AI developers,
pathologists, and researchers, discussed key aspects and conducted extensive
literature reviews on test datasets in pathology. Here, we summarize the
results and derive general recommendations for the collection of test datasets.
We address several questions: Which and how many images are needed? How to
deal with low-prevalence subsets? How can potential bias be detected? How
should datasets be reported? What are the regulatory requirements in different
countries?
The recommendations are intended to help AI developers demonstrate the
utility of their products and to help regulatory agencies and end users verify
reported performance measures. Further research is needed to formulate criteria
for sufficiently representative test datasets so that AI solutions can operate
with less user intervention and better support diagnostic workflows in the
future.

人工智能在数字组织学图像中的自动信息提取已被证明可以改善病理诊断。然而，在正式使用前，需要评估其预测性能并获得监管机构的认可，而这需要恰当的测试数据集，本文总结了病理测试数据集的一般建议，旨在帮助人工智能开发者证明其产品的实用性，并帮助监管机构和最终用户验证所报导的性能指标。

评估病理学 AI 解决方案的测试数据集建议

Recommendations on test datasets for evaluating AI solutions in pathology

The Deep Noise Suppression (DNS) challenge is designed to foster innovation
in the area of noise suppression to achieve superior perceptual speech quality.
We recently organized a DNS challenge special session at INTERSPEECH and ICASSP
2020. We open-sourced training and test datasets for the wideband scenario. We
also open-sourced a subjective evaluation framework based on ITU-T standard
P.808, which was also used to evaluate participants of the challenge. Many
researchers from academia and industry made significant contributions to push
the field forward, yet even the best noise suppressor was far from achieving
superior speech quality in challenging scenarios. In this version of the
challenge organized at INTERSPEECH 2021, we are expanding both our training and
test datasets to accommodate full band scenarios. The two tracks in this
challenge will focus on real-time denoising for (i) wide band, and(ii) full
band scenarios. We are also making available a reliable non-intrusive objective
speech quality metric called DNSMOS for the participants to use during their
development phase.

Deep Noise Suppression Challenge aims to improve speech quality through open-sourced datasets and evaluation frameworks using two tracks focused on real-time denoising for wideband and full band scenarios, as well as making available a reliable objective speech quality metric called DNSMOS.