Establishing vocabulary tests as a benchmark for evaluating large language models

March 14, 2025

Publications

>

Article

Sí

Establishing vocabulary tests as a benchmark for evaluating large language models

Publicated to: PLOS ONE. 19 (12): e0308259- - 2024-12-12 19(12), DOI: 10.1371/journal.pone.0308259

Authors:

Martinez, Gonzalo; Conde, Javier; Merino-Gomez, Elena; Bermudez-Margaretto, Beatriz; Hernandez, Jose Alberto; Reviriego, Pedro; Brysbaert, Marc

[+]

Affiliations

Univ Carlos III Madrid, Dept Ingn Telemat, Leganes, Spain - Author

Univ Ghent, Dept Expt Psychol, Ghent, Belgium - Author

Univ Politecn Madrid, ETSI Telecomunicac, Madrid, Spain - Author

Univ Salamanca, Dept Psicol Basica Psicobiol & Metodol Las CC Com, Salamanca, Spain - Author

Univ Valladolid, Escuela Ingn Ind, Valladolid, Spain - Author

Abstract

Vocabulary tests, once a cornerstone of language modeling evaluation, have been largely overlooked in the current landscape of Large Language Models (LLMs) like Llama 2, Mistral, and GPT. While most LLM evaluation benchmarks focus on specific tasks or domain-specific knowledge, they often neglect the fundamental linguistic aspects of language understanding. In this paper, we advocate for the revival of vocabulary tests as a valuable tool for assessing LLM performance. We evaluate seven LLMs using two vocabulary test formats across two languages and uncover surprising gaps in their lexical knowledge. These findings shed light on the intricacies of LLM word representations, their learning mechanisms, and performance variations across models and languages. Moreover, the ability to automatically generate and perform vocabulary tests offers new opportunities to expand the approach and provide a more complete picture of LLMs' language skills.

[+]

Keywords

AcquisitionBenchmarkingHumansLanguageLanguage testsLextalModels, theoreticalQuality educationVocabularyWord recognition

Quality index

Bibliometric impact. Analysis of the contribution and dissemination channel

The work has been published in the journal PLOS ONE due to its progression and the good impact it has achieved in recent years, according to the agency Scopus (SJR), it has become a reference in its field. In the year of publication of the work, 2024 there are still no calculated indicators, but in 2023, it was in position , thus managing to position itself as a Q1 (Primer Cuartil), in the category Multidisciplinary.

Independientemente del impacto esperado determinado por el canal de difusión, es importante destacar el impacto real observado de la propia aportación.

Según las diferentes agencias de indexación, el número de citas acumuladas por esta publicación hasta la fecha 2026-04-24:

Google Scholar: 2
WoS: 3
Scopus: 3

[+]

Impact and social visibility

From the perspective of influence or social adoption, and based on metrics associated with mentions and interactions provided by agencies specializing in calculating the so-called "Alternative or Social Metrics," we can highlight as of 2026-04-24:

The use, from an academic perspective evidenced by the Altmetric agency indicator referring to aggregations made by the personal bibliographic manager Mendeley, gives us a total of: 13.
The use of this contribution in bookmarks, code forks, additions to favorite lists for recurrent reading, as well as general views, indicates that someone is using the publication as a basis for their current work. This may be a notable indicator of future more formal and academic citations. This claim is supported by the result of the "Capture" indicator, which yields a total of: 11 (PlumX).

It is essential to present evidence supporting full alignment with institutional principles and guidelines on Open Science and the Conservation and Dissemination of Intellectual Heritage. A clear example of this is:

The work has been submitted to a journal whose editorial policy allows open Open Access publication.
Assignment of a Handle/URN as an identifier within the deposit in the Institutional Repository: https://oa.upm.es/85330/

As a result of the publication of the work in the institutional repository, statistical usage data has been obtained that reflects its impact. In terms of dissemination, we can state that, as of

Views: 215
Downloads: 36

Continuing with the social impact of the work, it is important to emphasize that, due to its content, it can be assigned to the area of interest of ODS 4 - Quality Education, with a probability of 88% according to the mBERT algorithm developed by Aurora University.

[+]

Leadership analysis of institutional authors

This work has been carried out with international collaboration, specifically with researchers from: Belgium.

the author responsible for correspondence tasks has been REVIRIEGO VASALLO, PEDRO.

[+]

Project objectives

La aportación persigue los siguientes objetivos: establecer las pruebas de vocabulario como un referente para evaluar modelos de lenguaje grandes (LLMs); analizar el desempeño de siete LLMs utilizando dos formatos de pruebas de vocabulario en dos idiomas; identificar las deficiencias en el conocimiento léxico de los modelos evaluados; caracterizar las representaciones de palabras y los mecanismos de aprendizaje de los LLMs; evaluar las variaciones de rendimiento entre modelos e idiomas; y explorar la generación automática de pruebas de vocabulario para ampliar y mejorar la evaluación de las habilidades lingüísticas de los LLMs.

[+]

Most relevant results

El estudio presenta una evaluación detallada de siete modelos de lenguaje a gran escala (LLMs) mediante pruebas de vocabulario en dos formatos y dos idiomas. En primer lugar, se identificaron brechas significativas en el conocimiento léxico de los modelos analizados. En segundo lugar, se evidenciaron variaciones notables en el rendimiento según el modelo y el idioma evaluado. En tercer lugar, se destacó la capacidad de los LLMs para generar y realizar automáticamente pruebas de vocabulario. Finalmente, estos resultados subrayan la importancia de las pruebas de vocabulario para comprender mejor las representaciones de palabras y los mecanismos de aprendizaje en los LLMs.

[+]

Awards linked to the item

This work was partially supported by the project CyberTutor: Asistente educativo personalizado basado en Grandes Modelos de Lenguaje (LLM), funded by "Primeros Proyectos" call from ETSIT, UPM; by the FUN4DATE (PID2022-136684OB-C22) and ENTRUDIT (TED2021-130118B-I00 projects funded by the Spanish Agencia Estatal de Investigacion (AEI); by the Chips Act Joint Undertaking project SMARTY (Grant no. 101140087) and by the OpenAI API Research Access Program. The funders had not played in study design, data collection and analysis, decision to publish, or preparation of the manuscript.

[+]

Indexed in

License and Use

Citations

Altmetrics

Impact on the Sustainable Development Goals (SDGs)

Analysis of institutional authors

Share

Establishing vocabulary tests as a benchmark for evaluating large language models

Affiliations

Abstract

Keywords

Quality index

Bibliometric impact. Analysis of the contribution and dissemination channel

Leadership analysis of institutional authors

Project objectives

Most relevant results

Awards linked to the item