Spanish is not just one: A dataset of Spanish dialect recognition for LLMs

October 14, 2025

Publications

>

Article

Sí

Spanish is not just one: A dataset of Spanish dialect recognition for LLMs

Publicated to: Data in Brief. 63 112088- - 2025-12-01 63(), DOI: 10.1016/j.dib.2025.112088

Authors:

Martinez, G; Mayor-Rocher, M; Huertas, CP; Melero, N; Grandury, M; Reviriego, P

[+]

Affiliations

NYU, C Barquillo 13,Madrid Campus, Madrid 28004, Spain - Author

SomosNLP, Madrid, Spain - Author

Univ Autonoma Madrid, Fac Filosofia & Letras, C-Francisco Tomas & Valiente 1, Madrid 28049, Spain - Author

Univ Politecn Madrid, Informat Proc & Telecommun Ctr IPTC, Avda Complutense 30, Madrid 28040, Spain - Author

Abstract

This paper presents a dataset designed to assess the capability of Large Language Models (LLMs) in handling different Spanish dialects. While multilingualism is widely recognized as a crucial aspect of NLP, dialectal evaluation remains largely unexplored. Spanish, spoken by over 600 million people, exhibits significant lexical, morphological, and syntactic variation across regions. Recognizing these linguistic and cultural differences is essential for preserving smaller dialects, preventing their marginalization, and ensuring that Spanish is not reduced to a monolithic language. To address this gap, we introduce a dataset specifically designed to analyze whether LLMs can accurately identify different Spanish varieties while also measuring their potential preference for specific dialects. The dataset consists of 30 carefully crafted multiple-choice questions, requiring models to select the most appropriate option from different regional variations. Each question has been meticulously developed and reviewed by linguistic experts, undergoing multiple refinement cycles to ensure linguistic accuracy and effectiveness in detecting dialectal biases. This dataset represents an important step toward developing more inclusive and fair evaluation frameworks for Spanish Natural Language Processing (NLP). By identifying potential biases in LLMs and analyzing their ability to adapt to regional linguistic variations, this work contributes to the broader goal of equitable language representation in AI-driven text generation and comprehension tasks. (c) 2025 The Authors. Published by Elsevier Inc. This is an open access article under the CC BY license.

[+]

Keywords

AiArtificial intelligenceComputational linguisticsEvaluationLanguage modelLanguage processingLanguage variationLarge language modelLarge language modelsLearning algorithmsLearning systemsLinguisticsMachine learningMachine-learningMultilingualismNatural language processingNatural language processing systemsNatural languagesSpanish dialectsSyntactics

Quality index

Bibliometric impact. Analysis of the contribution and dissemination channel

The work has been published in the journal Data in Brief, and although the journal is classified in the quartile Q3 (Agencia WoS (JCR)), its regional focus and specialization in Multidisciplinary Sciences, give it significant recognition in a specific niche of scientific knowledge at an international level.

[+]

Leadership analysis of institutional authors

This work has been carried out with international collaboration, specifically with researchers from: United States of America.

There is a significant leadership presence as some of the institution’s authors appear as the first or last signer, detailed as follows: First Author (MARTINEZ RUIZ DE ARCAUTE, GONZALO) and Last Author (REVIRIEGO VASALLO, PEDRO).

the author responsible for correspondence tasks has been MARTINEZ RUIZ DE ARCAUTE, GONZALO.

[+]

Project objectives

La aportación persigue los siguientes objetivos: analizar la capacidad de los modelos de lenguaje de gran tamaño (LLMs) para reconocer diferentes dialectos del español; evaluar la variación léxica, morfológica y sintáctica entre las variedades regionales del español; determinar la posible preferencia o sesgo de los LLMs hacia ciertos dialectos específicos; caracterizar la efectividad de un conjunto de 30 preguntas de opción múltiple diseñadas para detectar sesgos dialectales; y contribuir al desarrollo de marcos de evaluación más inclusivos y justos en el procesamiento del lenguaje natural en español, promoviendo una representación equitativa de las variedades lingüísticas en tareas de generación y comprensión de texto.

[+]

Most relevant results

Los resultados más relevantes de esta aportación se centran en la evaluación de la capacidad de los modelos de lenguaje grandes (LLMs) para reconocer dialectos del español. En primer lugar, se desarrolló un conjunto de datos compuesto por 30 preguntas de opción múltiple, diseñadas para identificar variaciones regionales léxicas, morfológicas y sintácticas. En segundo lugar, cada pregunta fue elaborada y revisada por expertos lingüísticos, asegurando precisión y eficacia en la detección de sesgos dialectales. En tercer lugar, el análisis reveló diferencias significativas en la habilidad de los LLMs para identificar variedades específicas del español, evidenciando posibles preferencias dialectales. Finalmente, este trabajo establece un marco inclusivo para evaluar la representación equitativa del español en tareas de procesamiento del lenguaje natural.

[+]

Awards linked to the item

This work was supported by the FUN4DATE (PID2022-136684OB-C21/C22) and SMARTY (PCI2024-153434) projects funded by the Spanish Agencia Estatal de Investigacion (AEI) 10.13039/50110 0 011033 , by the European Union Chips Act Joint Undertaking project SMARTY (Grant no 101140087) and by the OpenAI Researcher Access Program. The evaluation was also done in part with equipment that was donated by NVIDIA to support our research.

[+]

Indexed in

License and Use

Altmetrics

Analysis of institutional authors

Share

Spanish is not just one: A dataset of Spanish dialect recognition for LLMs

Affiliations

Abstract

Keywords

Quality index

Bibliometric impact. Analysis of the contribution and dissemination channel

Leadership analysis of institutional authors

Project objectives

Most relevant results

Awards linked to the item

Indexed in

License and Use

Altmetrics

Analysis of institutional authors

Share

Spanish is not just one: A dataset of Spanish dialect recognition for LLMs

Affiliations

Abstract

Keywords

Quality index

Bibliometric impact. Analysis of the contribution and dissemination channel

Impact and social visibility

Leadership analysis of institutional authors

Project objectives

Most relevant results

Awards linked to the item