An Empirical Evaluation of Large Language Models in Static Code Analysis for PHP Vulnerability Detection

26 de noviembre de 2024

Publicaciones

>

Artículo

No

An Empirical Evaluation of Large Language Models in Static Code Analysis for PHP Vulnerability Detection

Publicado en:Journal Of Universal Computer Science. 30 (9): 1163-1183 - 2024-01-01 30(9), DOI: 10.3897/jucs.134739

Autores: Çetin, O; Ekmekcioglu, E; Arief, B; Hernandez-Castro, J

Afiliaciones

Sabanci Univ, Istanbul, Turkiye - Autor o Coautor

Univ Kent, Canterbury, England - Autor o Coautor

Univ Politecn Madrid, Madrid, Spain - Autor o Coautor

Resumen

Web services play an important role in our daily lives. They are used in a wide range of activities, from online banking and shopping to education, entertainment and social interactions. Therefore, it is essential to ensure that they are kept as secure as possible. However - as is the case with any complex software system - creating a sophisticated software free from any security vulnerabilities is a very challenging task. One method to enhance software security is by employing static code analysis. This technique can be used to identify potential vulnerabilities in the source code before they are exploited by bad actors. This approach has been instrumental in tackling many vulnerabilities, but it is not without limitations. Recent research suggests that static code analysis can benefit from the use of large language models (LLMs). This is a promising line of research, but there are still very few and quite limited studies in the literature on the effectiveness of various LLMs at detecting vulnerabilities in source code. This is the research gap that we aim to address in this work. Our study examined five notable LLM chatbot models: ChatGPT 4, ChatGPT 3.5, Claude, Bard/Gemini1, 1 , and Llama-2, assessing their abilities to identify 104 known vulnerabilities spanning the Top-10 categories defined by the Open Worldwide Application Security Project (OWASP). Moreover, we evaluated issues related to these LLMs' false-positive rates using 97 patched code samples. We specifically focused on PHP vulnerabilities, given its prevalence in web applications. We found that ChatGPT-4 has the highest vulnerability detection rate, with over 61.5% of vulnerabilities found, followed by ChatGPT-3.5 at 50%. Bard has the highest rate of vulnerabilities missed, at 53.8%, and the lowest detection rate, at 13.4%. For all models, there is a significant percentage of vulnerabilities that were classified as partially found, indicating a level of uncertainty or incomplete detection across all tested LLMs. Moreover, we found that ChatGPT-4 and ChatGPT-3.5 are consistently more effective across most categories, compared to other models. Bard and Llama-2 display limited effectiveness in detecting vulnerabilities across the majority of categories listed. Surprisingly, our findings reveal high false positive rates across all LLMs. Even the model demonstrating the best performance (ChatGPT-4) notched a false positive rate of nearly 63%, while several models glaringly under-performed, hitting startlingly bad false positive rates of over 90%. Finally, simultaneously deploying multiple LLMs for static analysis resulted in only a marginal enhancement in the rates of vulnerability detection. We believe these results are generalizable to most other programming languages, and hence far from being limited to PHP only.

Palabras clave

BardChatgptClaudeGeminiLlama-2Llm in cybersecuritLlm in cybersecurityPhp vulnerabilitiesStatic code analysisVulnerability detection

Indicios de calidad

Impacto bibliométrico. Análisis de la aportación y canal de difusión

El trabajo ha sido publicado en la revista Journal Of Universal Computer Science, y aunque la revista se encuentra clasificada en el cuartil Q3 (Agencia WoS (JCR)), su enfoque regional y su especialización en Computer Science, Theory & Methods, le otorgan un reconocimiento lo suficientemente significativo en un nicho concreto del conocimiento científico a nivel internacional.

2025-08-18:

Scopus: 1

Impacto y visibilidad social

Análisis de liderazgo de los autores institucionales

Este trabajo se ha realizado con colaboración internacional, concretamente con investigadores de: Turkey; United Kingdom.

Existe un liderazgo significativo ya que algunos de los autores pertenecientes a la institución aparecen como primer o último firmante, se puede apreciar en el detalle: Último Autor (HERNANDEZ CASTRO, JULIO CESAR).

Indexado en

Licencia y uso

Citaciones

Altmetrics

Análisis de autorías institucional

Compartir

An Empirical Evaluation of Large Language Models in Static Code Analysis for PHP Vulnerability Detection

Afiliaciones

Resumen

Palabras clave

Indicios de calidad

Impacto bibliométrico. Análisis de la aportación y canal de difusión

Impacto y visibilidad social

Análisis de liderazgo de los autores institucionales