While Italian is by all metrics a high resource language, currently, there
are isn't a Language Model pre-trained exclusively in this language. This
results in a lower number of available benchmarks to evaluate the performance
of language models in Italian.
This work presents two new benchmarks to evaluate the models performance on
mathematical understanding and language understanding in Italian. These
benchmarks are based on real tests that are undertaken by students of age
between 11 and 18 within the Italian school system and have therefore been
validated by several experts in didactics and pedagogy.
To validate this dataset we evaluate the performance of 9 language models
that are the best performing when writing in Italian, including our own
fine-tuned models. We show that this is a challenging benchmark where current
language models are bound by 60\% accuracy.
We believe that the release of this dataset paves the way for improving
future models mathematical and language understanding in Italian.

通过许多模型的评估，研究表明：目前意大利语存在着缺乏针对该语言的预训练语言模型的现象，从而导致意大利语的语言模型评估数据较少。该研究提出了两个基于 11 至 18 岁学生在意大利学校系统中进行的真实测试的评估基准，经多位教学和教育专家验证。在意大利写作时，通过评估 9 个表现最佳的语言模型，包括研究者自己的微调模型，发现当前语言模型在该基准上的准确率约为 60％。研究者相信，该数据集的发布为改进未来的意大利语数学和语言理解模型铺平了道路。

Invalsi 基准：测量意大利中文数学和语言理解的语言模型

The Invalsi Benchmark: measuring Language Models Mathematical and  Language understanding in Italian

It has been suggested that large language models such as GPT-4 have acquired
some form of understanding beyond the correlations among the words in text
including some understanding of mathematics as well. Here, we perform a
critical inquiry into this claim by evaluating the mathematical understanding
of the GPT-4 model. Considering that GPT-4's training set is a secret, it is
not straightforward to evaluate whether the model's correct answers are based
on a mathematical understanding or based on replication of proofs that the
model has seen before. We specifically craft mathematical questions which their
formal proofs are not readily available on the web, proofs that are more likely
not seen by the GPT-4. We see that GPT-4 is unable to solve those problems
despite their simplicity. It is hard to find scientific evidence suggesting
that GPT-4 has acquired an understanding of even basic mathematical concepts. A
straightforward way to find failure modes of GPT-4 in theorem proving is to
craft questions where their formal proofs are not available on the web. Our
finding suggests that GPT-4's ability is to reproduce, rephrase, and polish the
mathematical proofs that it has seen before, and not in grasping mathematical
concepts. We also see that GPT-4's ability to prove mathematical theorems is
continuously expanding over time despite the claim that it is a fixed model. We
suggest that the task of proving mathematical theorems in formal language is
comparable to the methods used in search engines such as Google while
predicting the next word in a sentence may be a misguided approach, a recipe
that often leads to excessive extrapolation and eventual failures. Prompting
the GPT-4 over and over may benefit the GPT-4 and the OpenAI, but we question
whether it is valuable for machine learning or for theorem proving.

GPT-4 的研究调查发现，尽管该模型可以重复、改编和润色其之前见过的数学证明，然而它并未实际理解基本数学概念，而在形式语言中证明数学定理的任务与搜索引擎如 Google 的方法相当，而预测句子中的下一个词可能是一种错误的方法，往往会导致过度推断和最终失败。

大型语言模型对数学的理解：源批评和推演

Large Language Models' Understanding of Math: Source Criticism and  Extrapolation

Mathematical understanding and reasoning are crucial tasks for assessing the
capabilities of artificial intelligence (AI). However, existing benchmarks
either require just a few steps of reasoning, or only contain a small amount of
data in one specific topic, making it hard to analyse AI's behaviour with
reference to different problems within a specific topic in detail. In this
work, we propose Conic10K, a challenging math problem dataset on conic sections
in Chinese senior high school education. Our dataset contains various problems
with different reasoning depths, while only the knowledge from conic sections
is required. Since the dataset only involves a narrow range of knowledge, it is
easy to separately analyse the knowledge a model possesses and the reasoning
ability it has. For each problem, we provide a high-quality formal
representation, the reasoning steps, and the final solution. Experiments show
that existing large language models, including GPT-4, exhibit weak performance
on complex reasoning. We hope that our findings could inspire more advanced
techniques for precise natural language understanding and reasoning. Our
dataset and codes are available at this https URL

我们提出了 Conic10K，一个具有挑战性的数学问题数据集，主要针对中国高中教育中的二次曲线部分。我们的数据集包含具有不同推理深度的各种问题，仅需要二次曲线部分的知识。通过实验证明，包括 GPT-4 在内的现有大型语言模型在复杂推理方面表现不佳。我们希望我们的研究结果可以激发出更先进的精确自然语言理解和推理技术。