Upon reviewing the recent cybersecurity study conducted by researchers Richard Fang, Rohan Bindu, Akul Gupta, and Daniel Kang, I found myself intrigued. Initially impressed by the capability of ChatGPT 4 to exploit the majority of vulnerabilities within a day, I began pondering the implications of these findings in the realm of cybersecurity. Specifically, I reflected on how the outcomes would compare if the tasks were performed by a human cybersecurity expert.

In seeking answers, I engaged in a discussion with Shanchieh Yang, who serves as the Director of Research at the Global Cybersecurity Institute at the Rochester Institute of Technology. Interestingly, he had also contemplated similar questions following his review of the research.

What are your opinions regarding the research findings?

Yang: I believe that the claim of 87% success may be an exaggeration. It would be beneficial for the community if the authors could provide more insights into their experiments and code, as this transparency would aid others in examining their work. I view large language models (LLMs) as akin to a hacking co-pilot, where human input, options, and feedback are essential. In my view, LLMs serve more as educational aids for training rather than autonomous hacking tools. I also questioned whether the study was entirely automated, devoid of any human intervention.

Comparing the capabilities of LLMs now to those six months ago, it is evident that they offer significant guidance on exploiting vulnerabilities by suggesting tools, providing commands, and outlining step-by-step processes. While they are mostly accurate, they are not infallible. The term “one-day” vulnerability encompasses scenarios ranging from familiar vulnerabilities to entirely novel malware with unique source code. In cases of entirely new vulnerability types, LLMs may be less effective due to the need for human comprehension in breaking into uncharted territory.

The effectiveness of the results is also influenced by the type of vulnerability, be it a web service, SQL server, print server, or router, as there exists a multitude of computing vulnerabilities. Personally, I find the 87% claim to be overstated, as its accuracy is contingent on the number of attempts made by the authors. If I were to review this as a paper, I might challenge the claim due to its broad generalization.

If a group of cybersecurity professionals were pitted against an LLM agent in a race to exploit unknown vulnerabilities within a target such as a newly released Hack the Box or Try Me Hack, who would emerge victorious?

The experts — skilled hackers, ethical hackers, white hats — would outperform LLMs. They possess a wide arsenal of tools, experience, and speed. However, the inherent limitation of LLMs is their machine nature, meaning even cutting-edge models cannot offer solutions without appropriate prompts. Therefore, the outcomes heavily rely on the inputs utilized, a factor compounded by the lack of code disclosure by the researchers.

Any additional reflections on the research?

Yang: I urge the community to prioritize responsible dissemination — sharing findings not just for citation or self-promotion but with a sense of responsibility. Transparency in experiments, code sharing, and proposing potential solutions are key aspects to consider.