Over the past few years, the abilities of large language models (LLMs) have
received extensive attention, which have performed exceptionally well in
complicated scenarios such as logical reasoning and symbolic inference. A
significant factor contributing to this progress is the benefit of in-context
learning and few-shot prompting. However, the reasons behind the success of
such models using contextual reasoning have not been fully explored. Do LLMs
have understand logical rules to draw inferences, or do they ``guess'' the
answers by learning a type of probabilistic mapping through context? This paper
investigates the reasoning capabilities of LLMs on two logical reasoning
datasets by using counterfactual methods to replace context text and modify
logical concepts. Based on our analysis, it is found that LLMs do not truly
understand logical rules; rather, in-context learning has simply enhanced the
likelihood of these models arriving at the correct answers. If one alters
certain words in the context text or changes the concepts of logical terms, the
outputs of LLMs can be significantly disrupted, leading to counter-intuitive
responses. This work provides critical insights into the limitations of LLMs,
underscoring the need for more robust mechanisms to ensure reliable logical
reasoning in LLMs.

大型语言模型在逻辑推理和符号推理等复杂场景中表现出色，但其在理解逻辑规则上存在限制，本文通过反事实方法探讨了大型语言模型的推理能力，强调了加强机制以确保其可靠的逻辑推理的需求。

大型语言模型理解逻辑还是仅仅模仿语境？

Do Large Language Models Understand Logic or Just Mimick Context?

Both in academic and industry-based research, online evaluation methods are
seen as the golden standard for interactive applications like recommendation
systems. Naturally, the reason for this is that we can directly measure utility
metrics that rely on interventions, being the recommendations that are being
shown to users. Nevertheless, online evaluation methods are costly for a number
of reasons, and a clear need remains for reliable offline evaluation
procedures. In industry, offline metrics are often used as a first-line
evaluation to generate promising candidate models to evaluate online. In
academic work, limited access to online systems makes offline metrics the de
facto approach to validating novel methods. Two classes of offline metrics
exist: proxy-based methods, and counterfactual methods. The first class is
often poorly correlated with the online metrics we care about, and the latter
class only provides theoretical guarantees under assumptions that cannot be
fulfilled in real-world environments. Here, we make the case that
simulation-based comparisons provide ways forward beyond offline metrics, and
argue that they are a preferable means of evaluation.

本文介绍了推荐系统等互动应用中在线评估方法的重要性，分析了离线评估方法的特点，提出了使用基于模拟的比较作为评估手段的优点。

优化奖励的推荐系统的离线评估：仿真案例

Offline Evaluation of Reward-Optimizing Recommender Systems: The Case of Simulation

Learning to Rank (LTR) from user interactions is challenging as user feedback
often contains high levels of bias and noise. At the moment, two methodologies
for dealing with bias prevail in the field of LTR: counterfactual methods that
learn from historical data and model user behavior to deal with biases; and
online methods that perform interventions to deal with bias but use no explicit
user models. For practitioners the decision between either methodology is very
important because of its direct impact on end users. Nevertheless, there has
never been a direct comparison between these two approaches to unbiased LTR. In
this study we provide the first benchmarking of both counterfactual and online
LTR methods under different experimental conditions. Our results show that the
choice between the methodologies is consequential and depends on the presence
of selection bias, and the degree of position bias and interaction noise. In
settings with little bias or noise counterfactual methods can obtain the
highest ranking performance; however, in other circumstances their optimization
can be detrimental to the user experience. Conversely, online methods are very
robust to bias and noise but require control over the displayed rankings. Our
findings confirm and contradict existing expectations on the impact of
model-based and intervention-based methods in LTR, and allow practitioners to
make an informed decision between the two methodologies.

本研究对 LTR 领域的两种方法进行了第一次直接比较。研究结果表明，这两种方法在不同实验条件下性能存在显著差异，对于选择哪一种方法，需要考虑选择偏差、位置偏差和交互噪声的程度。