{rfName}
Sp

License and Use

Icono OpenAccess

Altmetrics

Analysis of institutional authors

Martinez, GonzaloCorresponding AuthorHuertas, Cris PozoAuthorGrandury, MariaAuthorReviriego, PedroAuthor

Share

October 14, 2025
Publications
>
Article

Spanish is not just one: A dataset of Spanish dialect recognition for LLMs

Publicated to: Data in Brief. 63 112088- - 2025-12-01 63(), DOI: 10.1016/j.dib.2025.112088

Authors:

Martinez, G; Mayor-Rocher, M; Huertas, CP; Melero, N; Grandury, M; Reviriego, P
[+]

Affiliations

NYU, C Barquillo 13,Madrid Campus, Madrid 28004, Spain - Author
SomosNLP, Madrid, Spain - Author
Univ Autonoma Madrid, Fac Filosofia & Letras, C-Francisco Tomas & Valiente 1, Madrid 28049, Spain - Author
Univ Politecn Madrid, Informat Proc & Telecommun Ctr IPTC, Avda Complutense 30, Madrid 28040, Spain - Author
See more

Abstract

This paper presents a dataset designed to assess the capability of Large Language Models (LLMs) in handling different Spanish dialects. While multilingualism is widely recognized as a crucial aspect of NLP, dialectal evaluation remains largely unexplored. Spanish, spoken by over 600 million people, exhibits significant lexical, morphological, and syntactic variation across regions. Recognizing these linguistic and cultural differences is essential for preserving smaller dialects, preventing their marginalization, and ensuring that Spanish is not reduced to a monolithic language. To address this gap, we introduce a dataset specifically designed to analyze whether LLMs can accurately identify different Spanish varieties while also measuring their potential preference for specific dialects. The dataset consists of 30 carefully crafted multiple-choice questions, requiring models to select the most appropriate option from different regional variations. Each question has been meticulously developed and reviewed by linguistic experts, undergoing multiple refinement cycles to ensure linguistic accuracy and effectiveness in detecting dialectal biases. This dataset represents an important step toward developing more inclusive and fair evaluation frameworks for Spanish Natural Language Processing (NLP). By identifying potential biases in LLMs and analyzing their ability to adapt to regional linguistic variations, this work contributes to the broader goal of equitable language representation in AI-driven text generation and comprehension tasks. (c) 2025 The Authors. Published by Elsevier Inc. This is an open access article under the CC BY license.
[+]

Keywords

AiArtificial intelligenceComputational linguisticsEvaluationLanguage modelLanguage processingLanguage variationLarge language modelLarge language modelsLearning algorithmsLearning systemsLinguisticsMachine learningMachine-learningMultilingualismNatural language processingNatural language processing systemsNatural languagesSpanish dialectsSyntactics

Quality index

Bibliometric impact. Analysis of the contribution and dissemination channel

The work has been published in the journal Data in Brief, and although the journal is classified in the quartile Q3 (Agencia WoS (JCR)), its regional focus and specialization in Multidisciplinary Sciences, give it significant recognition in a specific niche of scientific knowledge at an international level.

[+]

Impact and social visibility

From the perspective of influence or social adoption, and based on metrics associated with mentions and interactions provided by agencies specializing in calculating the so-called "Alternative or Social Metrics," we can highlight as of 2026-04-25:

  • The use, from an academic perspective evidenced by the Altmetric agency indicator referring to aggregations made by the personal bibliographic manager Mendeley, gives us a total of: 8.
  • The use of this contribution in bookmarks, code forks, additions to favorite lists for recurrent reading, as well as general views, indicates that someone is using the publication as a basis for their current work. This may be a notable indicator of future more formal and academic citations. This claim is supported by the result of the "Capture" indicator, which yields a total of: 8 (PlumX).

With a more dissemination-oriented intent and targeting more general audiences, we can observe other more global scores such as:

  • The Total Score from Altmetric: 1.
  • The number of mentions on the social network X (formerly Twitter): 1 (Altmetric).

It is essential to present evidence supporting full alignment with institutional principles and guidelines on Open Science and the Conservation and Dissemination of Intellectual Heritage. A clear example of this is:

  • The work has been submitted to a journal whose editorial policy allows open Open Access publication.
  • Assignment of a Handle/URN as an identifier within the deposit in the Institutional Repository: https://oa.upm.es/91162/

As a result of the publication of the work in the institutional repository, statistical usage data has been obtained that reflects its impact. In terms of dissemination, we can state that, as of

  • Views: 100
  • Downloads: 96
[+]

Leadership analysis of institutional authors

This work has been carried out with international collaboration, specifically with researchers from: United States of America.

There is a significant leadership presence as some of the institution’s authors appear as the first or last signer, detailed as follows: First Author (MARTINEZ RUIZ DE ARCAUTE, GONZALO) and Last Author (REVIRIEGO VASALLO, PEDRO).

the author responsible for correspondence tasks has been MARTINEZ RUIZ DE ARCAUTE, GONZALO.

[+]

Project objectives

La aportación persigue los siguientes objetivos: analizar la capacidad de los modelos de lenguaje de gran tamaño (LLMs) para reconocer diferentes dialectos del español; evaluar la variación léxica, morfológica y sintáctica entre las variedades regionales del español; determinar la posible preferencia o sesgo de los LLMs hacia ciertos dialectos específicos; caracterizar la efectividad de un conjunto de 30 preguntas de opción múltiple diseñadas para detectar sesgos dialectales; y contribuir al desarrollo de marcos de evaluación más inclusivos y justos en el procesamiento del lenguaje natural en español, promoviendo una representación equitativa de las variedades lingüísticas en tareas de generación y comprensión de texto.
[+]

Most relevant results

Los resultados más relevantes de esta aportación se centran en la evaluación de la capacidad de los modelos de lenguaje grandes (LLMs) para reconocer dialectos del español. En primer lugar, se desarrolló un conjunto de datos compuesto por 30 preguntas de opción múltiple, diseñadas para identificar variaciones regionales léxicas, morfológicas y sintácticas. En segundo lugar, cada pregunta fue elaborada y revisada por expertos lingüísticos, asegurando precisión y eficacia en la detección de sesgos dialectales. En tercer lugar, el análisis reveló diferencias significativas en la habilidad de los LLMs para identificar variedades específicas del español, evidenciando posibles preferencias dialectales. Finalmente, este trabajo establece un marco inclusivo para evaluar la representación equitativa del español en tareas de procesamiento del lenguaje natural.
[+]

Awards linked to the item

This work was supported by the FUN4DATE (PID2022-136684OB-C21/C22) and SMARTY (PCI2024-153434) projects funded by the Spanish Agencia Estatal de Investigacion (AEI) 10.13039/50110 0 011033 , by the European Union Chips Act Joint Undertaking project SMARTY (Grant no 101140087) and by the OpenAI Researcher Access Program. The evaluation was also done in part with equipment that was donated by NVIDIA to support our research.
[+]