{rfName}
An

Indexed in

License and use

Icono OpenAccess

Citations

2

Altmetrics

Analysis of institutional authors

Hernandez-Castro, JulioAuthor

Share

November 26, 2024
Publications
>
Article
No

An Empirical Evaluation of Large Language Models in Static Code Analysis for PHP Vulnerability Detection

Publicated to:Journal Of Universal Computer Science. 30 (9): 1163-1183 - 2024-01-01 30(9), DOI: 10.3897/jucs.134739

Authors: Çetin, O; Ekmekcioglu, E; Arief, B; Hernandez-Castro, J

Affiliations

Sabanci Univ, Istanbul, Turkiye - Author
Univ Kent, Canterbury, England - Author
Univ Politecn Madrid, Madrid, Spain - Author
See more

Abstract

Web services play an important role in our daily lives. They are used in a wide range of activities, from online banking and shopping to education, entertainment and social interactions. Therefore, it is essential to ensure that they are kept as secure as possible. However - as is the case with any complex software system - creating a sophisticated software free from any security vulnerabilities is a very challenging task. One method to enhance software security is by employing static code analysis. This technique can be used to identify potential vulnerabilities in the source code before they are exploited by bad actors. This approach has been instrumental in tackling many vulnerabilities, but it is not without limitations. Recent research suggests that static code analysis can benefit from the use of large language models (LLMs). This is a promising line of research, but there are still very few and quite limited studies in the literature on the effectiveness of various LLMs at detecting vulnerabilities in source code. This is the research gap that we aim to address in this work. Our study examined five notable LLM chatbot models: ChatGPT 4, ChatGPT 3.5, Claude, Bard/Gemini1, 1 , and Llama-2, assessing their abilities to identify 104 known vulnerabilities spanning the Top-10 categories defined by the Open Worldwide Application Security Project (OWASP). Moreover, we evaluated issues related to these LLMs' false-positive rates using 97 patched code samples. We specifically focused on PHP vulnerabilities, given its prevalence in web applications. We found that ChatGPT-4 has the highest vulnerability detection rate, with over 61.5% of vulnerabilities found, followed by ChatGPT-3.5 at 50%. Bard has the highest rate of vulnerabilities missed, at 53.8%, and the lowest detection rate, at 13.4%. For all models, there is a significant percentage of vulnerabilities that were classified as partially found, indicating a level of uncertainty or incomplete detection across all tested LLMs. Moreover, we found that ChatGPT-4 and ChatGPT-3.5 are consistently more effective across most categories, compared to other models. Bard and Llama-2 display limited effectiveness in detecting vulnerabilities across the majority of categories listed. Surprisingly, our findings reveal high false positive rates across all LLMs. Even the model demonstrating the best performance (ChatGPT-4) notched a false positive rate of nearly 63%, while several models glaringly under-performed, hitting startlingly bad false positive rates of over 90%. Finally, simultaneously deploying multiple LLMs for static analysis resulted in only a marginal enhancement in the rates of vulnerability detection. We believe these results are generalizable to most other programming languages, and hence far from being limited to PHP only.

Keywords

BardChatgptClaudeGeminiLlama-2Llm in cybersecuritLlm in cybersecurityPhp vulnerabilitiesStatic code analysisVulnerability detection

Quality index

Bibliometric impact. Analysis of the contribution and dissemination channel

The work has been published in the journal Journal Of Universal Computer Science, and although the journal is classified in the quartile Q3 (Agencia WoS (JCR)), its regional focus and specialization in Computer Science, Theory & Methods, give it significant recognition in a specific niche of scientific knowledge at an international level.

Independientemente del impacto esperado determinado por el canal de difusión, es importante destacar el impacto real observado de la propia aportación.

Según las diferentes agencias de indexación, el número de citas acumuladas por esta publicación hasta la fecha 2025-11-11:

  • Scopus: 1

Impact and social visibility

From the perspective of influence or social adoption, and based on metrics associated with mentions and interactions provided by agencies specializing in calculating the so-called "Alternative or Social Metrics," we can highlight as of 2025-11-11:

  • The use of this contribution in bookmarks, code forks, additions to favorite lists for recurrent reading, as well as general views, indicates that someone is using the publication as a basis for their current work. This may be a notable indicator of future more formal and academic citations. This claim is supported by the result of the "Capture" indicator, which yields a total of: 26 (PlumX).

It is essential to present evidence supporting full alignment with institutional principles and guidelines on Open Science and the Conservation and Dissemination of Intellectual Heritage. A clear example of this is:

  • The work has been submitted to a journal whose editorial policy allows open Open Access publication.
  • Additionally, the work has been submitted to a journal classified as Diamond in relation to this type of editorial policy.

Leadership analysis of institutional authors

This work has been carried out with international collaboration, specifically with researchers from: Turkey; United Kingdom.

There is a significant leadership presence as some of the institution’s authors appear as the first or last signer, detailed as follows: Last Author (HERNANDEZ CASTRO, JULIO CESAR).