An Empirical Evaluation of Large Language Models in Static Code Analysis for PHP Vulnerability Detection

November 26, 2024

Publications

>

Article

No

An Empirical Evaluation of Large Language Models in Static Code Analysis for PHP Vulnerability Detection

Publicated to:Journal Of Universal Computer Science. 30 (9): 1163-1183 - 2024-01-01 30(9), DOI: 10.3897/jucs.134739

Authors: Çetin, O; Ekmekcioglu, E; Arief, B; Hernandez-Castro, J

Affiliations

Sabanci Univ, Istanbul, Turkiye - Author

Univ Kent, Canterbury, England - Author

Univ Politecn Madrid, Madrid, Spain - Author

Abstract

Web services play an important role in our daily lives. They are used in a wide range of activities, from online banking and shopping to education, entertainment and social interactions. Therefore, it is essential to ensure that they are kept as secure as possible. However - as is the case with any complex software system - creating a sophisticated software free from any security vulnerabilities is a very challenging task. One method to enhance software security is by employing static code analysis. This technique can be used to identify potential vulnerabilities in the source code before they are exploited by bad actors. This approach has been instrumental in tackling many vulnerabilities, but it is not without limitations. Recent research suggests that static code analysis can benefit from the use of large language models (LLMs). This is a promising line of research, but there are still very few and quite limited studies in the literature on the effectiveness of various LLMs at detecting vulnerabilities in source code. This is the research gap that we aim to address in this work. Our study examined five notable LLM chatbot models: ChatGPT 4, ChatGPT 3.5, Claude, Bard/Gemini1, 1 , and Llama-2, assessing their abilities to identify 104 known vulnerabilities spanning the Top-10 categories defined by the Open Worldwide Application Security Project (OWASP). Moreover, we evaluated issues related to these LLMs' false-positive rates using 97 patched code samples. We specifically focused on PHP vulnerabilities, given its prevalence in web applications. We found that ChatGPT-4 has the highest vulnerability detection rate, with over 61.5% of vulnerabilities found, followed by ChatGPT-3.5 at 50%. Bard has the highest rate of vulnerabilities missed, at 53.8%, and the lowest detection rate, at 13.4%. For all models, there is a significant percentage of vulnerabilities that were classified as partially found, indicating a level of uncertainty or incomplete detection across all tested LLMs. Moreover, we found that ChatGPT-4 and ChatGPT-3.5 are consistently more effective across most categories, compared to other models. Bard and Llama-2 display limited effectiveness in detecting vulnerabilities across the majority of categories listed. Surprisingly, our findings reveal high false positive rates across all LLMs. Even the model demonstrating the best performance (ChatGPT-4) notched a false positive rate of nearly 63%, while several models glaringly under-performed, hitting startlingly bad false positive rates of over 90%. Finally, simultaneously deploying multiple LLMs for static analysis resulted in only a marginal enhancement in the rates of vulnerability detection. We believe these results are generalizable to most other programming languages, and hence far from being limited to PHP only.

Keywords

BardChatgptClaudeGeminiLlama-2Llm in cybersecuritLlm in cybersecurityPhp vulnerabilitiesStatic code analysisVulnerability detection

Quality index

Bibliometric impact. Analysis of the contribution and dissemination channel

The work has been published in the journal Journal Of Universal Computer Science, and although the journal is classified in the quartile Q3 (Agencia WoS (JCR)), its regional focus and specialization in Computer Science, Theory & Methods, give it significant recognition in a specific niche of scientific knowledge at an international level.

Independientemente del impacto esperado determinado por el canal de difusión, es importante destacar el impacto real observado de la propia aportación.

Según las diferentes agencias de indexación, el número de citas acumuladas por esta publicación hasta la fecha 2025-11-11:

Scopus: 1

Impact and social visibility

Leadership analysis of institutional authors

This work has been carried out with international collaboration, specifically with researchers from: Turkey; United Kingdom.

There is a significant leadership presence as some of the institution’s authors appear as the first or last signer, detailed as follows: Last Author (HERNANDEZ CASTRO, JULIO CESAR).

Indexed in

License and use

Citations

Altmetrics

Analysis of institutional authors

Share

An Empirical Evaluation of Large Language Models in Static Code Analysis for PHP Vulnerability Detection

Affiliations

Abstract

Keywords

Quality index

Bibliometric impact. Analysis of the contribution and dissemination channel

Impact and social visibility

Leadership analysis of institutional authors