Natural Language Inference (NLI) is a central task in natural language understanding with applications in fact-checking, question answering, and information retrieval. Despite its importance, current NLI systems heavily rely on supervised learning with datasets that often contain annotation artifacts and biases, limiting generalization and real-world applicability. In this work, we apply a reinforcement learning-based approach using Group Relative Policy Optimization (GRPO) for Chain-of-Thought (CoT) learning in NLI, eliminating the need for labeled rationales and enabling this type of training on more challenging datasets such as ANLI. We fine-tune 7B, 14B, and 32B language models using parameter-efficient techniques (LoRA and QLoRA), demonstrating strong performance across standard and adversarial NLI benchmarks. Our 32B AWQ-quantized model surpasses state-of-the-art results on 7 out of 11 adversarial sets$\unicode{x2013}$or on all of them considering our replication$\unicode{x2013}$within a 22GB memory footprint, showing that robust reasoning can be retained under aggressive quantization. This work provides a scalable and practical framework for building robust NLI systems without sacrificing inference quality.

本研究针对当前自然语言推理（NLI）系统依赖有偏注释数据的问题，提出了一种基于强化学习的方法，通过群体相对策略优化（GRPO）进行链式思考（CoT）学习，消除对标注推理的需求，并支持在更具挑战性的数据集上进行训练。研究表明，经过微调的32B AWQ量化模型在多项对抗性NLI基准上超越了最先进的结果，证明了在激进量化条件下仍能保持强大的推理能力。

推动自然语言推理的边界