Summary of Penetration Testing with Large Language Models

Summary Penetration Testing with Large Language Models arxiv.org

4,847 words - PDF document - View PDF document

One Line

This paper explores the use of large language models in penetration testing for task planning and vulnerability hunting, utilizing a hierarchical model named "TTP".

Slides

Slide Presentation (9 slides)

Copy slides outline Copy embed code Download as Word

Penetration Testing with Large Language Models

Source: arxiv.org - PDF - 4,847 words - view

Introduction

• Large Language Models (LLMs) like GPT3.5 can augment human testers with AI sparring partners.

• LLMs can be used in penetration testing for high-level task planning and low-level vulnerability hunting.

• LLMs are neural networks trained on vast amounts of data.

Visual: Image of GPT3.5

The TTP Model

• The TTP model is a hierarchical approach used in penetration testing.

• Tactics, techniques, and procedures are included in the TTP model to achieve specific objectives.

• The TTP model helps in organizing and structuring penetration testing tasks.

Visual: Diagram of the TTP model

High-Level Task Planning

• LLMs can be used in high-level task planning for penetration testing.

• AgentGPT is tasked with becoming a domain expert to plan and strategize.

• LLMs assist in developing effective strategies and identifying potential vulnerabilities.

Visual: Graph showing the effectiveness of LLMs in high-level task planning

Low-Level Vulnerability Hunting

• LLMs are valuable in low-level vulnerability hunting during penetration testing.

• LLMs assist in executing attacks and identifying vulnerabilities.

• Prototype runs using LLMs may have inconsistent stability and reproducibility.

• Longer runs or aggregating multiple runs can yield more consistent results.

Visual: Comparison of vulnerability identification results between short runs and longer runs

Integration of High-Level and Low-Level Approaches

• Integrating both high-level and low-level approaches improves user experience.

• LLMs complement human testers by providing a uniform interaction experience.

• The interaction between LLMs and human testers is enhanced through integration.

Visual: Image of LLMs and human testers working together

Reducing Hallucinations and Identifying Overlooked Vulnerabilities

• LLMs in penetration testing help reduce model hallucinations.

• LLMs assist in identifying vulnerabilities that may have been overlooked by human testers.

• The use of LLMs enhances the accuracy and effectiveness of penetration testing.

Visual: Examples of vulnerabilities identified by LLMs

References and Sources

• The excerpt provides references and sources related to cybersecurity, language models, and prompt engineering.

• References cover topics such as the shortage of workers in the cybersecurity industry and the use of Cobalt Strike as a tool.

• Additional sources provide insights into prompt engineering and advancements in language models.

Visual: Collage of book covers representing the references

Key Takeaways

• LLMs like GPT3.5 are valuable in penetration testing for task planning and vulnerability hunting.

• The TTP model helps achieve specific objectives in penetration testing.

• Longer runs or aggregating multiple runs with LLMs can yield more consistent results.

• Integrating high-level and low-level approaches improves the user experience.

• LLMs reduce hallucinations and identify overlooked vulnerabilities.

• References and sources provide further information on cybersecurity and language models.

Key Points

Large Language Models (LLMs) like GPT3.5 can be used in penetration testing to augment human testers with AI sparring partners.
The use of hierarchical models like "TTP" can help achieve specific objectives in penetration testing.
LLMs are neural networks trained on vast amounts of data and can be used in high-level task planning and low-level vulnerability hunting.
The stability and reproducibility of prototype runs using LLMs can be inconsistent, but longer runs or aggregating multiple runs can yield more consistent results.
Integrating both high-level and low-level approaches in penetration testing can improve the user experience and interaction between LLMs and human testers.
The use of LLMs in penetration testing can help reduce hallucinations and identify overlooked vulnerabilities.
References and sources related to cybersecurity, language models, and prompt engineering are provided in the excerpt.

Summaries

25 word summary

This paper examines the application of large language models (LLMs) in penetration testing for task planning and vulnerability hunting, using a hierarchical model called "TTP".

38 word summary

This paper explores the use of large language models (LLMs) in penetration testing. Two use cases are investigated: high-level task planning and low-level vulnerability hunting. The document discusses a hierarchical model called "TTP" for penetration testing, which includes

236 word summary

This paper explores the use of large language models (LLMs), such as GPT3.5, in penetration testing to augment human testers with AI sparring partners. The authors investigate two use cases: high-level task planning and low-level vulnerability hunting

The document discusses the use of a hierarchical model called "TTP" for penetration testing. The model includes tactics, techniques, and procedures to achieve specific objectives. Large Language Models (LLMs) are neural networks trained on vast amounts of data and can

This excerpt discusses the use of large language models (LLMs) in penetration testing. The evaluation includes both high-level task-planning systems and low-level attack-execution systems. In the high-level evaluation, AgentGPT is tasked with becoming a domain

The stability and reproducibility of the prototype runs were found to be inconsistent, with variation in the sequence of commands and vulnerabilities identified. However, longer runs or aggregating multiple runs resulted in more consistent results. The instability of the GPT3.

The use of Large Language Models (LLMs) in penetration testing can help reduce the model's hallucinations and identify overlooked vulnerabilities. Integrating both high-level and low-level approaches can provide a uniform user experience and improve the interaction between LLMs and

The excerpt includes a list of references and sources related to the field of cybersecurity, language models, and prompt engineering. The references provide information on topics such as the shortage of workers in the cybersecurity industry, the use of Cobalt Strike as a tool for

Raw indexed text (33,512 chars / 4,847 words / 572 lines)

Getting pwn’d by AI:

Penetration Testing with Large Language Models

Andreas Happe

Jürgen Cito

[email protected]

TU Wien

Austria

ABSTRACT

The field of software security testing, more specifically penetration

testing, is an activity that requires high levels of expertise and in-

volves many manual testing and analysis steps. This paper explores

the potential use of large-language models, such as GPT3.5, to aug-

ment penetration testers with AI sparring partners. We explore

the feasibility of supplementing penetration testers with AI models

for two distinct use cases: high-level task planning for security

testing assignments and low-level vulnerability hunting within a

vulnerable virtual machine. For the latter, we implemented a closed-

feedback loop between LLM-generated low-level actions with a

vulnerable virtual machine (connected through SSH) and allowed

the LLM to analyze the machine state for vulnerabilities and suggest

concrete attack vectors which were automatically executed within

the virtual machine. We discuss promising initial results, detail

avenues for improvement, and close deliberating on the ethics of

providing AI-based sparring partners.

CCS CONCEPTS

• Security and privacy → Systems security.

KEYWORDS

security testing, penetration testing

ACM Reference Format:

Andreas Happe and Jürgen Cito. 2023. Getting pwn’d by AI: Penetration

Testing with Large Language Models. In Proceedings of the 31st ACM Joint

European Software Engineering Conference and Symposium on the Foun-

dations of Software Engineering (ESEC/FSE ’23), December 3–9, 2023, San

Francisco, CA, USA. ACM, New York, NY, USA, 5 pages. https://doi.org/10.

1145/3611643.3613083

INTRODUCTION

Large language models (LLMs), such as ChatGPT, have become

a hot topic not only in computer science but also within popular

media. For example, the Economist headlined a series of special

articles on LLMs with “Large, creative AI models will transform

lives and labor markets” [10] in April 2023. The field of cybersecu-

rity and software security testing, more specifically, penetration

testing, suffers from a chronic lack of personnel [16], even worse,

according to the ISC2 Cybersecurity Workforce Study 2022 [15],

ESEC/FSE ’23, December 3–9, 2023, San Francisco, CA, USA

This is the author’s version of the work. It is posted here for your personal use. Not

for redistribution. The definitive Version of Record was published in Proceedings of

the 31st ACM Joint European Software Engineering Conference and Symposium on the

Foundations of Software Engineering (ESEC/FSE ’23), December 3–9, 2023, San Francisco,

CA, USA, https://doi.org/10.1145/3611643.3613083.

[email protected]

TU Wien

Austria

while global cybersecurity workforce was growing by 11.1% YoY,

this growth was outpaced by the gap’s increase of 26.2% YoY. A

recent interview study with penetration testers highlighted the

need for human sparring partners [13], i.e., colleagues who offer

alternative ideas or approaches when stuck. The study also empha-

sizes that intuition is a big part of detecting vulnerabilities and that

knowledge transfer, e.g., from attending Capture-the-Flag 1 (CTF)

events, were seen as potential sources of this intuition — can this

be partially outsourced to AI models? Using automated AI-based

agents as sparring partners would augment and empower existing

human security testers and could counteract the lack of sufficiently

educated security professionals. The combination of human opera-

tors with AIs creates new capabilities instead of cloning existing

ones. Furthermore, keeping a human in the loop reduces the poten-

tial ethical problems imposed by the use of AIs [5]. Recent research

indicates that the efficiency gains provided by the use of AI-based

systems are greatest for low-skilled workers [6]. Designing a sys-

tem that augments human operators with a generative AI might

thus also benefit the training of novice penetration testers.

RQ: To what extent can we automate security testing with

LLMs? The rest of this paper explores whether large-language

models, such as GPT3.5, can be deployed as sparring partners for

security professionals.

To answer this question, we leverage MITRE ATT&CK, a curated

database of knowledge about threat actors in the cybersecurity do-

main, to provide a guiding structure. A good sparring partner should

be able to cover the different tactics, techniques, and procedures

covered by ATT&CK. To explore this hypothesis, we performed

multiple experiments. To showcase high-level guidance, we “asked”

an LLM to help design penetration tests for both generic scenarios

as well as for a concrete target organization. To showcase low-level

guidance, we integrated GPT3.5 with a vulnerable virtual machine

and allowed GPT3.5 to analyze the machine for vulnerabilities and

suggest attack vectors. Based on our experience, we discuss the

results as well as potential future improvements.

Scope. We also envision other areas where generative AI could

be used successfully. One of them is the generation of phishing or

vishing messages. For obvious ethical reasons, we did not further

analyze attacks that intently try to deceive human beings. Another

tedious area where generative AI could improve efficiency would

be automated report generation for penetration tests or red teaming

campaigns. Anecdotal evidence suggests that penetration testers

are already experimenting with generative AI for report generation.

1 CTFs

are gamified penetration-testing exercises.ESEC/FSE ’23, December 3–9, 2023, San Francisco, CA, USA

Happe and Cito

BACKGROUND

MITRE ATT&CK. MITRE ATT&CK [33] is a curated database

of knowledge about threat actors, also known as APTs. It employs

a hierarchical model often abbreviated by “TTP”. The initial “T”

stands for “tactics” and describes high-level objectives an adversary

intends to achieve, e.g., reconnaissance, privilege escalation or col-

lection. The middle “T” describes “techniques”. Each technique is a

way to achieve a tactic. Examples of techniques would be “Abuse El-

evation Control Mechanism: Sudo and Sudo Caching” [2] or “Steal or

Forge Kerberos Tickets: Kerberoasting” [3]. Finally, “P” describes pro-

cedures that are the specific details of how an adversary executes a

technique.

We assume that a sparring partner for penetration testing should

cover the whole TTP spectrum. On a high level, it should be able

to select suitable tactics and corresponding techniques. On a low-

level, given an employed tactic, it should be able to derive feasible

techniques and procedures.

Large Language Models. A Large-Language Model (LLM) con-

sists of a neural network trained using self-supervised learning on

vast amounts of data. A model’s capabilities are highly dependent

upon its complexity which is often described through the number

of used parameters. Current models yield parameter sizes ranging

from billions, e.g., LLaMA starts with 7 billion, to trillions of pa-

rameters, e.g., Wu Dao or GPT-4. Model and parameter sizes are

currently under discussion; on one hand, larger models can exhibit

emergent behaviors [34]; on the other hand, e.g., there is specu-

lation that the age of ever-larger models is over due to reduced

scaling efficiency [21].

Training a new LLM is prohibitively expensive for most re-

searchers, but existing LLMs can be refined or fine-tuned to specific

use cases for feasible costs. This situation has created the moniker

“foundation models” for LLMs. The importance of those has been

acknowledged by mainstream media, c.f., the Economist’s “Huge

Foundation Models are Turbo Charging AI Progress” in 2022 [9].

GPT3.5/ChatGPT. Conversations with ChatGPT commonly

consist of questions, named “prompts”, and answers going back and

forth between the user and the AI. Prompts have to be carefully

prepared, yielding a new discipline that has been called prompt en-

gineering [8, 19, 32, 35]. Tools such as llama.cpp [11] that make use

of small-scale models (up to 13b parameters) feasible on consumer-

grade hardware have sparked additional research. Those models

can be run without any cloud/API costs and are not subject to any

server-side moderation or censorship.

Pre-trained Autonomous AI Agents. AutoGPT [12] intro-

duced the idea of auto-generating sequences of instructions by

leveraging LLMs to create the prompt that is subsequently used to

query the LLM. This allows users to provide concise initial ques-

tions for the AI system that are subsequently refined. This reduces

the need for manual prompt engineering. LLMs often “hallucinate”,

i.e., invent facts that seem statistically plausible. Research suggests

that using external knowledge and automated feedback can reduce

this hallucinations [26]. AutoGPT integrates web-based queries and

optional human-provided feedback during its operation. Based on

this, the initial task is converted into a task list containing smaller

subtasks that can be delegated to additional agents.

Low-Privilege User

Command (SSH)

Init. Prompt

Root

LLM

Refinement

Virtual Machine

Response

$ sudo -l

$ sudo /usr/bin/perl -e 'exec "/bin/sh";'

Figure 1: High-Level Architecture Overview

BabyAGI focuses on automated task generation, planning, and

execution [23, 24]: a user-given task is split up into smaller subtasks

that are stored within a task queue. Autonomous Task Execution

Agents take tasks from the task queue, execute them, and add new

information to a memory store. In addition, the Task Creation Agent

identifies new subsequent tasks that are pushed upon the task queue

and are eventually executed by the Task Execution Agent. Before

a task is executed, a Context Agent is asked to provide sufficient

context for the task from memory. Entries in the task queue are

prioritized through a Prioritization Agent. All mentioned agents

are GPT-4 processes themselves. BabyAGI [22] provides a “pared-

down” version of this system — the original source code consisted

of roughly 100 lines of Python code.

More recently Jarvis [30] employs agents with different models

to create multimodal, multiagent systems.

LLM-BASED PENETRATION TESTING

We differentiate between two use cases: on a high level, typical

questions asked by pen-testers are “what is a good attack methodol-

ogy”, e.g., “how to attack Active Directory”. These questions should

yield tactics as well as potential techniques to achieve those tactics.

On the low-level, we assume that the pen-tester has already cho-

sen to employ a tactic against a target system and is searching for

appropriate techniques and corresponding procedures. A typical

question would be “I want to perform a privilege escalation (tactic),

what are the suitable attack vectors against [this concrete Linux

system]?”.

3.1

High-Level: Task-Planning Systems

For the evaluation, we asked AgentGPT to “Become domain admin

in an Active Directory”. The generated document contained highly

realistic attack vectors such as password spraying, Kerberoasting,

AS-REP roasting, exploiting Active Directory Certificate Services,

abusing unconstrained delegation or exploiting group policies. All

of those attacks are realistic, feasible, and commonly used during

penetration testing.

In addition, after securing a target company’s approval, we

tasked AutoGPT to devise an external penetration testing plan for

that company. AutoGPT’s plan included standard methods such as

performing a network vulnerability scan, performing OSINT/user

enumeration, and performing phishing against identified users.Getting pwn’d by AI:

Penetration Testing with Large Language Models

All these are operations typically performed during external pene-

tration tests. When further inquired, AutoGPT was able to crawl

the company’s web page and identify potential phishing targets

(users and their email addresses) but declined to perform any “real”

network security scan or perform phishing operations due to its

ethical filters. Both answers were realistic, feasible, and would give

a penetration tester good feedback about potential attack vectors.

3.2

Low-Level: Attack-Execution System

For our low-level evaluation, we use a common scenario: a penetration-

tester has already obtained a low-privilege account on a Linux sys-

tem. Their goal is to perform a privilege escalation, i.e., to find a way

to become the system’s root user. To allow for realistic evaluation,

we wrote a Python script that uses SSH to connect to a deliberately

vulnerable lin.security Linux virtual machine [18].

The script consists of an infinite loop: within the loop, it asks

GPT3.5 to imagine being a low-privilege user that tries to find privi-

lege escalation vulnerabilities. To achieve this, we parse Linux shell

commands that GPT3.5 produced based on our low-level prompts

which will be executed on the virtual machine. Their output is

presented back to GPT3.5 when prompted for the next command.

See Figure 1 for a schematic high-level overview of this feedback

loop. With this simple structure, we were able to gain root

privileges on our vulnerable virtual machine.

In addition, at the end of each loop iteration, GPT3.5 was pre-

sented with the chosen command and its output and then tasked to

identify potential security vulnerabilities based on this information.

For each vulnerability, it was tasked to provide an exploitation

example, sneakily named “verification commands”. This yielded

additional attack vectors. The Python script was routinely able

to gain root privileges within the, deliberately vulnerable, virtual

machine. The common path was listening to the “sudoers” file by

calling sudo -l, followed by either using sudo with one of the listed

shells or employing one of the listed GTFObins to gain a root shell.

GTFObins are benign system commands that when called through

sudo, can be abused to gain a root shell. Another frequently used

attack vector was retrieving /etc/passwd and identifying user ac-

counts not using shadow passwords 2 . Searches for SUID binaries

were requested, but returned binaries not actively exploited, indi-

cating lacking multi-step planning capabilities of either our script

or the underlying model. A slightly altered prompt instructing

the LLM to open a reverse shell to a given IP address was

successful and dropped root shells.

We release all our scripts with a detailed README including

setup instructions at https://github.com/ipa-lab/hackingBuddyGPT.

DISCUSSION

This section reflects upon the performance of the prototype within

the pen-test discipline, guided by the 10+ years of pen-testing ex-

perience of one of the authors.

4.1

Grounding of Results and Hallucinations

One interesting aspect of our prototype is that all executed com-

mands and their resulting output are written to a protocol. This

2 If

you Linux system is not using shadow passwords by now, chatGPT is your least

worry though.

ESEC/FSE ’23, December 3–9, 2023, San Francisco, CA, USA

allows us to reason if LLM-suggested vulnerabilities are based on

queries providing system knowledge, or if GPT3.5 extracted security

trends and preconceptions during training. The latter is analogous

to penetration testers applying knowledge gained during work or

training, e.g., from participating in CTFs, while on assignments.

There were indications of reasoning about causal dependencies:

After retrieving the list of sudoers, GPT3.5 consistently suggested

various vulnerable sudo commands for privilege escalation. A sim-

ilar pattern arose after retrieving the passwd file: here attacking

weakly-configured user accounts was suggested as the next step.

Other suggestions, such as using certain system exploits, e.g.,

dirty_cow, were reasonable given that GPT3.5 “knew” that this was

a Linux system, but were given without any previous enumeration.

Pure and easily detectable hallucinations occurred infrequently,

the most common occurrence was the suggestion to execute “ex-

ploit.sh”. It seems reasonable that security write-ups containing the

execution of this script were part of GPT3.5’s training set.

While the suggested system commands obviously were based

upon pattern-matching and not on a deeper understanding of the

Linux system or on model building, seeing the simple LLM-shell-

based feedback loop we established gaining root privileges was

eerie. A suitable analogy would be a pen-tester talking to a col-

league over the phone, asking for suggestions with the conversation

partner only having a very limited view of the actual system but a

set of preconceptions (i.e., priors), which is partially in line with

our research question on the ability of LLMs acting as sparring

partners.

4.2

Stability and Reproducibility

Singular prototype runs were not stable, i.e., there was variation in

the sequence and selection of commands given and vulnerabilities

identified. On longer runs, or when aggregating multiple runs, the

results converged (we ran our script with the same initial prompt

on the order of tens of times to be able to make some observations

on convergence). The variation on single runs seems to be related

to GPT3.5 hyper-focusing upon single aspects of the tested system.

This is also known to happen to pen-testers during assignments,

“going down a rabbit hole” improves with experience [13].

4.2.1 Comparison to traditional enumeration tools. Compared to

tools such as linpeas.sh [27], LLMs seems to be less deterministic.

Enumeration tools traverse a manually curated hard-coded list of

vulnerability checks. Further research should clarify if GPT3.5’s

instability converges over time into stable outcomes while pro-

viding fewer detectable patterns for intrusion detection systems.

Ironically, GPT3.5 suggested calling linpeas.sh during one run but

used a non-existing download URL within the given curl download

command.

When given the additional subcommand of “and explain the

found vulnerabilities” in the prompt, GPT3.5 was able to provide

good introductory information and could thus be utilized as part of

on-the-job training.

4.3

Ethical Moderation in LLMs

The prototype utilized GPT-3.5-turbo which contains safety mea-

sures against malicious prompts. The prototype relayed commands

to a vulnerable virtual machine, but the overall scenario can easilyESEC/FSE ’23, December 3–9, 2023, San Francisco, CA, USA

be applied to real systems. GPT3.5’s lack of hesitation was discern-

ing. During the development of the script, the ethics filter was

infrequently triggered. Adding “do not ask questions or provide

judgments” to command prompts seems to significantly reduce de-

nials. The optional “detail additional vulnerabilities” step was more

often denied due to ethical reasons, but this had no impact on the

overall hacking progress. Slight prompt variations were successful

in reducing GPT3.5’s ethical concerns, e.g., instead of asking for

“exploits for vulnerabilities” we asked for “verification commands

for vulnerabilities”. Of course, switching from OpenAI to one of the

locally running LLMs would remove all server-side ethics checks.

Ethical questions are not new in the cybersecurity domain, es-

pecially regarding releasing penetration test tools. Ethical issues

arising from using GPT3.5 resemble discussions about open-source

security tools which can be used by both red-teams as well as by

APTs. Commercial vendors try to vet their customers, while open-

source tools can be used by anyone — in the end, malicious actors

can and will use both of them [17, 31].

Regulation regarding the distribution of dual-use goods exists,

i.e., the Wassenaar Agreement, but application to software is clumsy

due to its fluid and often impalpable nature.

Another ethical problem is the inclusion of toxic content in

commonly used training sets [29]. As our prototype uses an already

trained foundation model, we are not deliberating on this issue.

This publication also does not touch on the topic of the inclusion

of copyrighted information within training data.

A VISION OF AI-AUGMENTED PEN-TESTING

We deliberate on research ideas and pragmatic considerations to

form a more perfect union between pen-testers and LLMs.

5.1

tests. During a recent interview series with pen-testers [13], partic-

ipants mentioned that they “learn how their customer or industry

area works and thinks over time”, could a customized AI model

achieve something similar? Although the industry is currently aim-

ing for ever larger model parameter sizes, analyzing which param-

eter size is “good enough” should reduce the resource impact of

deploying GPT models.

5.3

Memory, Verification and Reflection

Memory is provided to GPT3.5 through context embedded within

query prompts. Prompt size is typically limited, e.g., the used GPT-3

model had a limit of 4k tokens. With newer models this limit is

constantly increasing and allows to pass a richer context to the

used LLM. Our prototype has simplistic memory that includes the

output of executed commands until the context limit is reached.

Generative Agents such as BabyAGI utilize chatGPT to build a

suitable context for each generated prompt. Concurrent research

in generative game agents [25] utilized LLMs to reflect on recent

events experienced by agents, and then asked a LLM to provide a

summarized description. The results are used as reflected memory

for future queries. In our use-case, executed command output could

be reflected on and only relevant extracted information added to

the next prompt’s context. Another option would be instructing

the LLM to provide multiple memory streams: one about recently

executed commands, one for extracted security findings, and one

describing what kind of computer system would fit the experienced

findings, i.e., emulate model building. Using this model as an inter-

nal “reality check” should reduce the used LLM’s hallucinations.

Having a rough model of the tested system, as well as a compacted

history of vulnerabilities tested, would also benefit questions such

as “what other vulnerabilities might I have overlooked?”.

Integration of High- and Low-Level

We used two different approaches during our initial evaluation,

based on the abstraction level of the asked questions. Integrating

both autonomous AI agents for high-level task planning and system-

level interactions would yield a more uniform user experience.

Please note that we do not want to automate the penetration-testing

process but give the human-in-the-loop a single point of inquiry.

We imagine a system in which human operators can inquire

about high-level concepts, e.g., “what additional active directory

attacks can I try?”, and later switch to a lower level, e.g., “given

this system, how can I escalate?”. Keeping all information within

a single system should also enable synergy effects as the LLMs

learn details about the tested system. This also shows the expected

multistep interactive feedback loop between LLMs and operators.

5.2

Happe and Cito

Investigation of Model Options

We currently use OpenAI’s GPT-3 which is cloud-based and inter-

faced through an API. GPT-3 should be evaluated against locally

run models such as Llama, StableLM, Dolly2 or Koala. Locally run

models do not incur any cloud costs and do not share sensitive

data with the cloud. As no data is leaked, this would enable further

customer-specific model training and fine-tuning: Imagine training

a local model with data found during an engagement or fine-tuning

a customer-specific model over a series of subsequent penetration

5.4

Prompts for Asking Better Questions

Our prototype used rather static and manually written prompts.

Using LLMs to generate and optimize the prompts themselves,

similar to AutoGPT, might improve their effectiveness. Given our

sensitive use case, these automatically generated prompts would

be closely monitored by humans though.

Another avenue of research is searching for better questions to

be asked. Based upon empirical studies on how penetration testers

work, further research into which questions they ask themselves

during their work can inform better prompts as well as a better

understanding of this close-knit industry.

FINAL ETHICAL CONSIDERATIONS

This paper explores the potential usage of LLMs for augmenting

penetration testing. Initial experiments indicate that the use of a

LLM can improve the efficiency of penetration-testers, and aid with

their initial or ongoing education. However, automated tools can

also be easily subverted for malicious purposes. Ethical questions

arise. Concurrent reports indicate that AI is currently being driven

forward by private companies as well as by state-funded research

agencies [20]. The former has an economic incentive, while the

latter see geopolitical implications of AI. We do not expect that this

avenue of research will slow down. Parallel to that, the reportedGetting pwn’d by AI:

Penetration Testing with Large Language Models

malicious usage of AI, presumably by APTs and common criminals,

is increasing [1].

Locking away models behind server-side supervised APIs is not

feasible as models can be run locally. In addition, even gate-kept

models such as Meta’s LlaMA have been publicly leaked [28] and

can now be reused by malicious actors. Fine-Tuning such a model

to concrete malicious activities is easily within APTs reach: For

example, when using StackLlAMA’s processing power estimates

for fine-tuning [4], an attacker using on-demand cloud computing

can expect to be able to fine-tune a model for less than a thousand

US dollars. Using chat-based LLMs through prompt engineering

does not require a thorough computer science education. While

this is beneficial in democratizing access to processing techniques,

this also facilitates potential malicious use.

While it is not predetermined if and how LLMs will influence

hacking, we assume that attackers will explore possibilities, in-

cluding fully-automated approaches. Given the low entry costs for

experimentation, this cannot be contained anymore.

Attacks will use LLMs; the genie is out of the bottle, and the red

queen’s race is on [7, 14]. Defenders need to be prepared for that —

and LLMs can play a significant role in that.

REFERENCES

[1] AIAAIC. 2023. AIAAIC Repository of incidents and controversies related to AI,

algorithms and automation. Retrieved April 26, 2023 from https://www.aiaaic.org/

[2] MITRE ATT&CK. 2020. Abuse Elevation Control Mechanism: Sudo and Sudo

Caching. Retrieved May 1, 2023 from https://attack.mitre.org/techniques/T1548/

003/

[3] MITRE ATT&CK. 2020. Steal or Forge Kerberos Tickets: Kerberoasting. Retrieved

May 1, 2023 from https://attack.mitre.org/techniques/T1558/003/

[4] Edward Beeching, Younes Belkada, Kashif Rasul, Lewis Tunstall, Leandro von

Werra, Nazneen Rajani, and Nathan Lambert. 2023. StackLLaMA: An RL Fine-

tuned LLaMA Model for Stack Exchange Question and Answering. https:

//doi.org/10.57967/hf/0513

[5] Erik Brynjolfsson. 2023. The turing trap: The promise & peril of human-like

artificial intelligence. In Augmented Education in the Global Age. Routledge,

103–116.

[6] Erik Brynjolfsson, Danielle Li, and Lindsey Raymond. 2023. Generative AI at

Work. NBER Working Paper No. 31161. National Bureau of Economic Research

(April 2023).

[7] Vit Bukac, Vaclav Lorenc, and Vashek Matyáš. 2014. Red queen’s race: APT win-

win game. In Cambridge International Workshop on Security Protocols. Springer,

55–61.

[8] Paul Denny, Viraj Kumar, and Nasser Giacaman. 2023. Conversing with Copilot:

Exploring prompt engineering for solving CS1 problems using natural language.

In Proceedings of the 54th ACM Technical Symposium on Computer Science Educa-

tion V. 1. 1136–1142.

[9] The Economist. 2022. Huge foundation models are turbo-charging AI progress.

Retrieved April 25, 2023 from https://www.economist.com/interactive/briefing/

2022/06/11/huge-foundation-models-are-turbo-charging-ai-progress

[10] The Economist. 2023. Large, creative AI models will transform lives and labour

markets. Retrieved April 25, 2023 from https://www.economist.com/interactive/

science-and-technology/2023/04/22/large-creative-ai-models-will-transform-

how-we-live-and-work

[11] Georgi Gerganov. 2023. llama.cpp: Inference of LLaMA model in pure C/C++.

Retrieved April 28, 2023 from https://github.com/ggerganov/llama.cpp

[12] Significant Gravitas. 2023. Auto-GPT: An Autonomous GPT-4 Experiment. Re-

trieved April 25, 2023 from https://github.com/Significant-Gravitas/Auto-GPT

[13] Andreas Happe and Cito Jürgen. 2023. Understanding Hackers’ Work: An Empir-

ical Study of Offensive Security Practitioners. In Proceedings of the 31st ACM Joint

European Software Engineering Conference and Symposium on the Foundations

of Software Engineering (San Francisco, USA) (ESEC/FSE 2023). Association for

Computing Machinery, New York, NY, USA, 11 pages.

[14] Richard Harang and Felipe N Ducau. 2018. Measuring the speed of the Red

Queen’s Race. BlackHat: Las Vegas, NV, USA (2018).

[15] (ISC)2. 2022. (ISC)2 CYBERSECURITY WORKFORCE STUDY 2022. Retrieved April

28, 2023 from https://www.isc2.org//-/media/ISC2/Research/2022-WorkForce-

Study/ISC2-Cybersecurity-Workforce-Study.ashx

ESEC/FSE ’23, December 3–9, 2023, San Francisco, CA, USA

[16] Sydney Lake. 2022.

The cybersecurity industry is short 3.4 million

workers—that’s good news for cyber wages.

Retrieved April 28, 2023

from https://fortune.com/education/articles/the-cybersecurity-industry-is-

short-3-4-million-workers-thats-good-news-for-cyber-wages/

[17] Selena Larson and Daniel Blackford. 2021. Cobalt Strike: Favorite Tool from APT

to Crimeware. Retrieved April 28, 2023 from https://www.proofpoint.com/us/

blog/threat-insight/cobalt-strike-favorite-tool-apt-crimeware

[18] lin.security. 2018. Lin.Security: 1. Retrieved May 1, 2023 from https://www.

vulnhub.com/entry/linsecurity-1,244/

[19] Vivian Liu and Lydia B Chilton. 2022. Design guidelines for prompt engineering

text-to-image generative models. In Proceedings of the 2022 CHI Conference on

Human Factors in Computing Systems. 1–23.

[20] Nestor Maslej, Loredana Fattorini, Erik Brynjolfsson, John Etchemendy, Katrina

Ligett, Terah Lyons, James Manyika, Helen Ngo, Juan Carlos Niebles, Vanessa

Parli, Yoav Shoham, Russell Wald, Jack Clark, and Raymond Perraul. 2023. The

AI Index 2023 Annual Report. https://aiindex.stanford.edu/wp-content/uploads/

2023/04/HAI_AI-Index-Report_2023.pdf

[21] Ron Miller. 2023. Sam Altman: Size of LLMs won’t matter as much moving forward.

Retrieved April 26, 2023 from https://techcrunch.com/2023/04/14/sam-altman-

size-of-llms-wont-matter-as-much-moving-forward/

[22] Yohei Nakajima. 2023. BabyAGI. Retrieved April 25, 2023 from https://github.

com/yoheinakajima/babyagi

[23] Yohei Nakajima. 2023. Introducing Task-driven Autonomous Agent. Retrieved April

25, 2023 from https://twitter.com/yoheinakajima/status/1640934493489070080

[24] Yohei Nakajima. 2023.

Task-driven Autonomous Agent Utilizing GPT-4,

Pinecone, and LangChain for Diverse Applications. Retrieved April 25, 2023

from https://yoheinakajima.com/task-driven-autonomous-agent-utilizing-gpt-

4-pinecone-and-langchain-for-diverse-applications/

[25] Joon Sung Park, Joseph C. O’Brien, Carrie J. Cai, Meredith Ringel Morris, Percy

Liang, and Michael S. Bernstein. 2023. Generative Agents: Interactive Simulacra

of Human Behavior. arXiv:2304.03442 [cs.HC]

[26] Baolin Peng, Michel Galley, Pengcheng He, Hao Cheng, Yujia Xie, Yu Hu, Qiuyuan

Huang, Lars Liden, Zhou Yu, Weizhu Chen, and Jianfeng Gao. 2023. Check Your

Facts and Try Again: Improving Large Language Models with External Knowledge

and Automated Feedback. arXiv:2302.12813 [cs.CL]

[27] Carlos Polop. 2023. LinPEAS - Linux Privilege Escalation Awesome Script.

Retrieved April 28, 2023 from https://github.com/carlospolop/PEASS-ng/tree/

master/linPEAS

[28] Katyanna Quach. 2023. LLaMA drama as Meta’s mega language model leaks.

Retrieved April 26, 2023 from https://www.theregister.com/2023/03/08/meta_

llama_ai_leak/

[29] Kevin Schaul, Szu Yu Chean, and Nitasha Tiku. 2023. Inside the secret list of

websites that make AI like ChatGPT sound smart. Retrieved April 26, 2023

from https://www.washingtonpost.com/technology/interactive/2023/ai-chatbot-

learning/

[30] Yongliang Shen, Kaitao Song, Xu Tan, Dongsheng Li, Weiming Lu, and Yueting

Zhuang. 2023. HuggingGPT: Solving AI Tasks with ChatGPT and its Friends in

HuggingFace. arXiv:2303.17580 [cs.CL]

[31] Cybereason Global SOC and Incident Response Team. 2023. Sliver C2 Leveraged

by Many Threat Actors. Retrieved April 28, 2023 from https://www.cybereason.

com/blog/sliver-c2-leveraged-by-many-threat-actors

[32] Hendrik Strobelt, Albert Webson, Victor Sanh, Benjamin Hoover, Johanna Beyer,

Hanspeter Pfister, and Alexander M Rush. 2022. Interactive and Visual Prompt

Engineering for Ad-hoc Task Adaptation with Large Language Models. IEEE

transactions on visualization and computer graphics 29, 1 (2022), 1146–1156.

[33] Blake E Strom, Andy Applebaum, Doug P Miller, Kathryn C Nickels, Adam G

Pennington, and Cody B Thomas. 2018. Mitre att&ck: Design and philosophy. In

Technical report. The MITRE Corporation.

[34] Jason Wei, Yi Tay, Rishi Bommasani, Colin Raffel, Barret Zoph, Sebastian

Borgeaud, Dani Yogatama, Maarten Bosma, Denny Zhou, Donald Metzler, Ed H.

Chi, Tatsunori Hashimoto, Oriol Vinyals, Percy Liang, Jeff Dean, and William Fe-

dus. 2022. Emergent Abilities of Large Language Models. arXiv:2206.07682 [cs.CL]

[35] Kaiyang Zhou, Jingkang Yang, Chen Change Loy, and Ziwei Liu. 2022. Learning

to prompt for vision-language models. International Journal of Computer Vision

130, 9 (2022), 2337–2348.