Summary of DarkBERT Language Model for Dark Web

Summary DarkBERT Language Model for Dark Web arxiv.org

10,738 words - PDF document - View PDF document

One Line

DarkBERT is a language model designed for the Dark Web that outperforms other models in detecting illegal activities, making it useful for law enforcement and cybersecurity researchers.

Key Points

DarkBERT is a language model designed for the Dark Web to understand the unique language patterns of illegal online activities.
The model outperforms other language models in specific use cases related to cybersecurity, such as threat detection, dark web forum thread detection, leak site detection, and ransomware activity.
The model was trained on two datasets: DUTA and CoDA, which were preprocessed to remove empty pages and categories with low page counts.
The model can be used for cybersecurity and CTI applications on the Dark Web, including ransomware leak site detection and identifying threats on the Dark Web using a fill-mask approach that captures semantically related keywords.
The DarkBERT language model was evaluated on the CoDA cased dataset using confusion matrices and compared to BERT cased and RoBERTa models.

Summaries

234 word summary

The DarkBERT language model was developed for the Dark Web, using a pretraining corpus of 5.43 million pages and excluding low information density pages. It uses 10 predefined categories from CoDA to categorize pages and preprocesses text by masking IP addresses, URLs, and cryptocurrency addresses. The model can identify legal and illegal activities and outperforms existing language models. It was evaluated on several datasets achieving high performance and can be useful for law enforcement agencies and cybersecurity researchers to monitor and detect illegal activities in the dark web. The study evaluates precision at k for keyword sets related to drugs, and DarkBERT CoDA outperforms BERT Reddit for k ranging from 10 to 20. DarkBERT is a language model designed for the Dark Web that uses unsupervised and supervised learning techniques to understand the unique language patterns of illegal online activities. It outperforms other language models in specific use cases related to cybersecurity, such as threat detection, dark web forum thread detection, leak site detection, and ransomware activity. The model can be used for cybersecurity and CTI applications on the Dark Web and has potential applications for law enforcement agencies to monitor criminal activity. The construction process involved data collection, filtering, and text preprocessing, addressing ethical considerations such as removing sensitive information. The model was trained on two datasets: DUTA and CoDA, which were preprocessed to remove empty pages and categories with low page counts.

744 word summary

DarkBERT is a language model specifically designed for the Dark Web, using a combination of unsupervised and supervised learning techniques to understand the unique language patterns of illegal online activities. It has potential applications for law enforcement agencies to monitor criminal activity on the Dark Web. The linguistic differences between the Surface Web and the Dark Web are explored, and DarkBERT is shown to be capable of representing the language used in the Dark Web domain. The model outperforms other language models in specific use cases related to cybersecurity, such as threat detection, dark web forum thread detection, leak site detection, and ransomware activity. The construction process involved data collection, filtering, and text preprocessing, addressing ethical considerations such as removing sensitive information. Two variations of the text corpus were used for pretraining purposes, raw and preprocessed. The model was trained on two datasets: DUTA and CoDA, which were preprocessed to remove empty pages and categories with low page counts. The DarkBERT language model was evaluated for its ability to classify dark web activities, showing high similarity of pages in categories such as drugs, electronics, and gambling. The model outperformed other models for both datasets and performed similarly for cased and uncased models. The pre-trained language model can be used for cybersecurity and CTI applications on the Dark Web, including ransomware leak site detection, where the model's performance was found to be better than other language models. The model was also developed to detect noteworthy threads on the Dark Web by analyzing activities in hacking forums, outperforming other language models in this task. The model is effective in identifying threats on the Dark Web using a fill-mask approach that captures semantically related keywords. The dataset used for training includes activities targeting popular software or organizations, sharing sensitive or private information, and distributing critical malware or vulnerabilities. The DarkBERT Language Model for Dark Web compares three language models' ability to produce keyword sets related to drugs in the Dark Web. The study evaluates precision at k (P @k) for k ranging from 10 to 50 using ground truth data provided by Zhu et al. (2021). DarkBERT CoDA outperforms BERT Reddit in precision at k for k ranging from 10 to 20, but is overtaken for higher values of k. The model suggests more specific words related to drugs than BERT. DarkBERT is a language model designed for the Dark Web available only for academic research purposes. It outperforms existing language models and can be used for tasks related to sensitive information in the Dark Web. The model can identify legal and illegal activities and was evaluated on several datasets, achieving high performance. The authors suggest that their model can be useful for law enforcement agencies and cybersecurity researchers to monitor and detect illegal activities in the dark web. The DarkBERT language model was created for the Dark Web, filtering out pages with low information density and excluding sensitive information to comply with ethical guidelines. The DarkBERT Language Model was developed for the Dark Web using a pretraining corpus that excluded pages with low information density. The model uses 10 predefined categories from CoDA to categorize pages in the Dark Web and excludes the “Others” category. Category balancing was addressed to avoid bias towards certain activities, and per-page character count statistics were measured to remove pages with low information density. The final pretraining corpus consisted of 5.43 million pages. The model preprocesses text by masking IP addresses, URLs, and cryptocurrency addresses. Lengthy words (over 100 characters) are excluded from the pretraining corpus, and email addresses are often masked using text preprocessing. The model removes non-ASCII characters and uncommon characters in contemporary English to reduce noise during tokenization. DarkBERT can identify phrases specific to the Dark Web and correctly classify pages that contain them. The model masks Bitcoin, Ethereum, and Litecoin addresses as these three cryptocurrencies are among the most popular in the Dark Web. The DarkBERT language model was evaluated on the CoDA cased dataset using confusion matrices and compared to BERT cased and RoBERTa models. Hyperparameters for ransomware leak site detection and noteworthy thread detection are in Table 12. The classification pipeline used k-fold cross-validation (k=5) and fully-connected classification layers with an early stopping strategy. Evaluation was performed on raw and preprocessed inputs. Repeated k-fold validation (k=5) was used for each model due to the limited dataset size. Noteworthy thread detection also used this method. A leak site page sample can be seen in Figure 7.

2104 word summary

The DarkBERT language model for the dark web was evaluated on the CoDA cased dataset using confusion matrices. The model was compared to BERT cased and RoBERTa models. The hyperparameters used in ransomware leak site detection and noteworthy thread detection can be found in Table 12. The same classification pipeline as activity classification was used with k-fold cross-validation (k=5) and fully-connected classification layers on top of the [CLS] token. An early stopping strategy was utilized to avoid overfitting. The evaluation was performed on both raw and preprocessed inputs. An example data sample used for this task can be seen in Figure 6 and additional details on results. Due to the limited size of the dataset, repeated k-fold validation (k=5) was used for each model and variations in performance per run were averaged. Noteworthy thread detection also adopted this method. A leak site page sample in the dataset can be seen in Figure 7. DarkBERT is a language model designed for the Dark Web that can correctly classify pages containing activity-specific terms. Unlike BERT and RoBERTa, DarkBERT can identify phrases specific to the Dark Web and correctly classify pages that contain them. The model contains many domain-specific jargons that most pages misclassified by BERT and RoBERTa are correctly classified by. DarkBERT is trained on a machine with four NVIDIA A100 80GB GPUs and takes about 15 days to run. The model masks Bitcoin, Ethereum, and Litecoin addresses as these three cryptocurrencies are among the most popular in the Dark Web. While cryptocurrencies are secure by design and provide pseudonymity, they have been involved in illegal underground operations in the Dark Web. The DarkBERT language model is designed for the dark web, where cryptocurrency addresses and non-standard characters are common. The model removes non-ASCII characters and uncommon characters in contemporary English to reduce noise during tokenization. Lengthy words, such as cryptocurrency addresses, are masked to prevent misidentification. The model also classifies words with hash-like values as lengthy and masks them. The pretraining corpus has a unique word length distribution, and manual inspection reveals specific word lengths appear more frequently at higher levels. Lengthy words are masked with an identifier mask token, and executable content is removed from the text. The DarkBERT language model was developed for the Dark Web. The model preprocesses text by masking IP addresses, URLs, and cryptocurrency addresses. File names are not processed separately. Lengthy words (over 100 characters) are excluded from the pretraining corpus. Email addresses are often masked using text preprocessing. Two identifier types for URLs are masked, onion domain and non-onion domain addresses. All email addresses are masked, and some may include strings that can be traced to a single individual. The DarkBERT Language Model for Dark Web is discussed in this document. The implementation involves using identifier masks for text processing and preprocessing the pretraining corpus. Categories such as pornography make up a large fraction of all categories on the Dark Web. Deduplication and category balancing were performed to reduce the data size. Categories such as gambling and arms/weapons had deduplication rates of less than 10%. The pretraining corpus statistics after applying deduplication and category balancing are given in Table 9. The DarkBERT Language Model was developed for the Dark Web using a pretraining corpus that was filtered by character count to exclude pages with low information density. The model uses 10 predefined categories from CoDA to categorize pages in the Dark Web, and excludes the "Others" category due to misclassification errors. The model was implemented by finetuning the bert-base-uncased model from the Hugging Face library with the CoDA Dark Web text corpus. Category balancing was addressed to avoid bias towards certain activities, and per-page character count statistics were measured to remove pages with low information density. The final pretraining corpus consisted of 5.43 million pages. The DarkBERT language model was created for the Dark Web, filtering out pages with low information density and excluding sensitive information to comply with ethical guidelines. The model focuses on pages with abnormally high or low character counts, which are useful in representing the Dark Web. The document includes additional details on data filtering and references to related research. The DarkBERT Language Model for the Dark Web is a pre-trained language model that can identify and classify illegal activities in the dark web, such as drug trafficking, money laundering, and human trafficking. The model was trained on a large corpus of dark web data and uses a deep bidirectional transformer architecture similar to BERT. The model was evaluated on several datasets and achieved high performance in identifying illegal activities. The authors also compared their model to other state-of-the-art models and found that their model outperformed them in most cases. The authors suggest that their model can be useful for law enforcement agencies and cybersecurity researchers to monitor and detect illegal activities in the dark web. DarkBERT is a language model for the Dark Web that can identify legal and illegal activity. It was developed by Mhd Wesam Al-Nabki, Eduardo Fidalgo, and Enrique Alegre. DarkBERT requires task-specific data to fine-tune the model for specific tasks such as Ransomware Leak Site Detection and Thread Detection. The pretraining corpus for DarkBERT is primarily in English, making it limited for non-English tasks. The authors suggest building a multilingual language model for the Dark Web domain. The publicly available datasets for DarkBERT are limited, and additional research on tasks that do not have readily available datasets may be necessary. DarkBERT is a language model designed for the Dark Web and is available only for academic research purposes. The model has been trained on Dark Web datasets and is sensitive to ethical considerations. The preprocessed version of DarkBERT will be released during the conference, and both DUTA and CoDA are available upon request. The model has been tested on fill-mask and synonym inference tasks, and sensitive information has been masked to avoid any malpractices. The automated web crawler takes caution not to expose itself to any sensitive media. DarkBERT outperforms existing language models and can be used for tasks related to sensitive information in the Dark Web. DarkBERT Language Model for Dark Web is a study that compares three language models: DarkBERT CoDA, BERT CoDA, and BERT Reddit. The study evaluates how each language model produces keyword sets semantically related to drugs in the Dark Web. The evaluation is done using precision at k (P @k), where k ranges from 10 to 50. The ground truth data used in the study are from a sample dataset provided by Zhu et al. (2021). The dataset is composed of ground truth data (i.e., drug names and their euphemisms) and sentences containing the drug names. The study shows that DarkBERT CoDA outperforms BERT Reddit in precision at k for k ranging from 10 to 20, but is overtaken for higher values of k. The study also shows that DarkBERT CoDA suggests more specific words related to drugs while BERT suggests general words. In addition, the study provides a sample drug sales page from the Dark Web in which a user advertises a Dutch MDMA pill with a Philipp Plein logo. The DarkBERT language model is effective in identifying threats on the Dark Web using a fill-mask approach that captures semantically related keywords. The model's performance is influenced by the viewpoint of the language models and the noteworthiness of threads. The performance of DarkBERT is higher than other language models in detecting noteworthy threads, but thread detection is still a challenging task. The dataset used for training includes 249 positive and 1624 negative threads, and annotators achieve substantial agreement in selecting noteworthy threads. The study focuses on activities targeting popular software or organizations, sharing sensitive or private information, and distributing critical malware or vulnerabilities. The DarkBERT Language Model was developed to detect noteworthy threads on the Dark Web, which can potentially cause damage to victims, by analyzing activities in hacking forums. To create a dataset of noteworthy threads, two researchers were recruited from the cybersecurity industry to annotate threads on the Dark Web. The detection of noteworthy threads is a highly subjective task, and DarkBERT outperforms other language models in this task. DarkBERT uses RoBERTa as a base model and performs better with preprocessed input data than raw input data. The training data consists of 105 positive and 679 negative examples, and the model is trained using 5-fold cross validation. The training data only includes Dark Web pages that are classified as Cryptocurrency, Financial, and Others. The DarkBERT Language Model for Dark Web is a pre-trained language model that can be used for cybersecurity and CTI applications on the Dark Web. One use case is ransomware leak site detection, where the model can identify whether a given page is a leak site or not. The model's performance was compared to other language models such as BERT and RoBERTa. Leak sites are mostly classified under categories like Pornography and Gambling, and pages with content similar to that of leak sites were selected to create negative data for training the model. The model's effectiveness was demonstrated in various experiments, and its performance was found to be better than other language models. The DarkBERT language model was evaluated for its ability to classify dark web activities. The model showed high similarity of pages in categories such as drugs, electronics, and gambling, but some categories had varying classification accuracy. The model was compared to other language models and performed relatively well. The evaluation was conducted on two datasets, DUTA and CoDA, with two variants each: cased and uncased. DarkBERT outperformed other models for both datasets. The experiment also tested the effect of letter case on classification performance and found that an uncased model performed similarly to a cased model. The DarkBERT language model was developed for text classification on the Dark Web. The model was trained on two datasets: DUTA and CoDA, which were preprocessed to remove empty pages and categories with low page counts. The distribution of various activities on the Dark Web was studied, and a benchmark experiment was conducted to evaluate the model's performance. The pretraining text corpus used for DarkBERT was fed to RoBERTa, and the training losses for the two models were compared. Two versions of DarkBERT were built, one with raw text data and one with preprocessed text. The DarkBERT language model was created for the Dark Web. The construction process involved data collection, filtering, and text preprocessing. The pretraining corpus was filtered and preprocessed to address ethical considerations, such as removing sensitive information. The model was trained using the English texts collected from the Dark Web. Two variations of the text corpus were used for pretraining purposes, raw and preprocessed. The pretraining process took approximately 15 days. A domain-specific pretrained language model like DarkBERT may effectively reduce performance issues in Dark Web tasks. The article presents DarkBERT, a language model pretrained on a Dark Web corpus, which outperforms other pretrained language models on Dark Web-specific tasks related to cybersecurity. The article provides new datasets and potential use cases for DarkBERT and demonstrates its effectiveness in detecting threats, dark web forum thread detection, leak site detection, and ransomware activity. The linguistic differences between the Surface Web and the Dark Web are explored, and DarkBERT is shown to be capable of representing the language used in the Dark Web domain. The article compares DarkBERT to other widely used pretrained language models and illustrates the DarkBERT pretraining process. The article concludes that DarkBERT would prove valuable in ongoing efforts to handle cyber threats in the Dark Web domain. The DarkBERT language model has been developed specifically for the Dark Web due to the linguistic differences compared to the Surface Web. The use of natural language processing (NLP) techniques has become an integral part of cybersecurity and threat intelligence (CTI) research. The Dark Web is a valuable resource for CTI research, but it requires specialized models to handle the extreme lexical and structural diversity of the data. DarkBERT outperforms other language models in specific use cases and offers valuable insights to researchers. The model was trained on Dark Web data and takes into account the steps taken to filter and compile the text. DarkBERT is a language model designed for the Dark Web. It is specifically trained to understand the language used in illegal online activities such as drug trafficking, weapons sales, and human trafficking. The model uses a combination of unsupervised and supervised learning techniques to achieve a high level of accuracy in understanding the unique language patterns of the Dark Web. DarkBERT has the potential to be used by law enforcement agencies to monitor and detect criminal activity on the Dark Web.

Raw indexed text (66,168 chars / 10,738 words / 1,864 lines)

DarkBERT: A Language Model for the Dark Side of the Internet

Youngjin Jin 1

Eugene Jang 2

Jian Cui 2

Jin-Woo Chung 2

Yongjae Lee 2

Seungwon Shin 1

KAIST, Daejeon, South Korea

S2W Inc., Seongnam, South Korea

{ijinjin,claude}@kaist.ac.kr

{genesith,geeoon19,jwchung,lee}@s2w.inc

Abstract

Recent research has suggested that there are

clear differences in the language used in the

Dark Web compared to that of the Surface Web.

As studies on the Dark Web commonly re-

quire textual analysis of the domain, language

models specific to the Dark Web may provide

valuable insights to researchers. In this work,

we introduce DarkBERT, a language model

pretrained on Dark Web data. We describe

the steps taken to filter and compile the text

data used to train DarkBERT to combat the

extreme lexical and structural diversity of the

Dark Web that may be detrimental to build-

ing a proper representation of the domain. We

evaluate DarkBERT and its vanilla counterpart

along with other widely used language mod-

els to validate the benefits that a Dark Web do-

main specific model offers in various use cases.

Our evaluations show that DarkBERT outper-

forms current language models and may serve

as a valuable resource for future research on

the Dark Web.

Introduction

The Dark Web is a subset of the Internet that is not

indexed by web search engines such as Google and

is inaccessible through a standard web browser. To

access the Dark Web, specialized overlay network

applications such as Tor (The Onion Router) (Din-

gledine et al., 2004) are required. Tor also hosts

hidden services (onion services) — web services

in which the client and the server IP addresses are

hidden from each other (Biryukov et al., 2013).

This sense of identity obscurity provided to the

Dark Web users comes with a catch; many of the

underground activities prevalent in the Dark Web

are immoral/illegal in nature, ranging from content

hosting such as data leaks to drug sales (Al Nabki

et al., 2017; Jin et al., 2022). As such, the pop-

ularity of the Dark Web as a platform of choice

for malicious activities has garnered interest from

researchers and security experts alike.

To handle the ever-changing landscape of mod-

ern cyber threats, cybersecurity experts and re-

searchers have started to employ natural language

processing (NLP) methods. Gaining evidence-

based knowledge such as indicators of compro-

mise (IOC) to mitigate emerging threats is an inte-

gral part of modern cybersecurity known as cyber

threat intelligence (CTI) (Liao et al., 2016; Bromi-

ley, 2016), and modern NLP tools have become an

indispensable part of CTI research. As such, the

use of NLP techniques has also been extended to

the Dark Web (Jin et al., 2022; Yoon et al., 2019;

Choshen et al., 2019; Al Nabki et al., 2017; Al-

Nabki et al., 2019; Yuan et al., 2018). The con-

tinued exploitation of the Dark Web as a platform

of cybercrime makes it a valuable and necessary

domain for CTI research.

Recently, Jin et al. (2022) observed that using a

BERT-based classification model achieves state-of-

the-art performance among available NLP methods

in the Dark Web. However, BERT is trained on

Surface Web 1 content (i.e., Wikipedia and Book-

Corpus) (Devlin et al., 2019), which has differ-

ent linguistic characteristics from that of the Dark

Web (Choshen et al., 2019). In the context of CTI,

this implies that popular pretrained language mod-

els such as BERT are not ideal for Dark Web re-

search in terms of extracting useful information

due to the differences in the language used in the

two domains. Consequently, an NLP tool that is

suitable for application in Dark Web domain tasks

would prove to be valuable in the ongoing efforts

of Dark Web cybersecurity.

In this paper, we propose DarkBERT, a new

language model pretrained on a Dark Web cor-

pus. To measure the usefulness of DarkBERT in

handling cyber threats in the Dark Web, we evalu-

ate DarkBERT in tasks related to detecting under-

ground activities. We compare DarkBERT to other

Web services and content that are readily available and

indexed in common search engines such as Google1. Data Collection

3. Text Preprocessing

example.com

192.168.1.1

ID_NORMAL_URL

ID_IP_ADDRESS

4. DarkBERT Pretraining

5. Evaluation & Use Case

preprocessed text

RoBERTa Preprocessed

DarkBERT

RoBERTa Raw

DarkBERT

Dark Web Activity

Ransomware &

Classification

Leak Site Detection

2. Data Filtering

Pretraining

corpus

a) page removal

b) category balancing

c) deduplication

raw text

Noteworthy

Thread Detection

Threat Keyword

Inference

Figure 1: Illustration of the DarkBERT pretraining process and the various use case scenarios for evaluation.

widely used pretrained language models BERT (De-

vlin et al., 2019) and RoBERTa (Liu et al., 2019)

that are trained on data found in the Surface Web

to verify the efficacy of DarkBERT in Dark Web

domain texts. Our evaluation results show that

DarkBERT-based classification model outperforms

that of known pretrained language models. Further-

more, we present potential use cases to illustrate the

benefits of utilizing DarkBERT in cybersecurity-

related tasks such as Dark Web forum thread detec-

tion and ransomware leak site detection.

Our contributions are summarized as follows:

• We introduce DarkBERT, a language model

pretrained on the Dark Web which is capa-

ble of representing the language used in the

domain compared to that of the Surface Web.

• We illustrate the effectiveness of DarkBERT

in the Dark Web domain. Our evaluations

show that DarkBERT is better suited for NLP

tasks on Dark Web specific texts compared to

other pretrained language models.

• We demonstrate potential use case scenarios

for DarkBERT and show that it is better-suited

for tasks related to cybersecurity compared to

other pretrained language models.

• We provide new datasets used for our Dark

Web domain use case evaluation.

Related Work

The recent availability of Dark Web resources (Jin

et al., 2022; Al Nabki et al., 2017; Al-Nabki et al.,

2019) has made it possible to explore the differ-

ences between the languages used in the Dark Web

and the Surface Web. Choshen et al. (2019) ex-

plored the differences in the illegal and legal pages

in the Dark Web and found a number of distin-

guishing features between the two domains such

as named entity, vocabulary, and syntactic struc-

ture. Their analyses using standard NLP tools have

also suggested that processing text in the Dark Web

domain would require considerable domain adap-

tation. The linguistic differences between the Sur-

face Web and the Dark Web were further examined

by Jin et al. (2022) through linguistic features such

as part-of-speech (POS) distribution and vocabu-

lary usage between the texts in the two domains.

Recently, Ranaldi et al. (2022) explored the use

of pretrained language models over Dark Web texts

to examine the effectiveness of such models, and

suggested that lexical and syntactic models such

as GloVe (Pennington et al., 2014) outperform pre-

trained models in some specific Dark Web tasks.

Meanwhile, Jin et al. (2022) demonstrated that pre-

trained language models in some Dark Web tasks

such as Dark Web activity classification perform

better than simple lexical models, suggesting that

language models like BERT show promising results

in the Dark Web. Either way, a domain-specific pre-

trained language model would be beneficial in that

it would be able to represent the language used in

the Dark Web, which may effectively reduce the

performance issues faced in previous experiments.

DarkBERT Construction

In this section, we describe the process for building

our Dark Web domain-specific pretrained language

model, DarkBERT. We begin by collecting pages

to build the text corpus used for pretraining Dark-

BERT (Section 3.1). Then, we filter the raw text

corpus and employ text preprocessing methods for

pretraining purposes (Section 3.2). Finally, we pre-

train DarkBERT using the text corpus (Section 3.3).Table 1: The two variations of Dark Web text corpus

used to train DarkBERT.

Raw Text

Preprocessed Text

Data Size Time Taken to Pretrain DarkBERT

5.83 GB

5.20 GB 367.4 hours (15.31 days)

361.6 hours (15.07 days)

An overview of the DarkBERT construction pro-

cess is illustrated in Figure 1.

3.1

Data Collection

A massive text corpus consisting of pages from the

Dark Web is necessary for pretraining DarkBERT.

We initially collect seed addresses from Ahmia 2

and public repositories containing lists of onion

domains. We then crawl the Dark Web for pages

from the initial seed addresses and expand our list

of domains, parsing each newly collected page with

the HTML title and body elements of each page

saved as a text file. We also classify each page by

its primary language using fastText (Joulin et al.,

2016a,b) and select pages labeled as English. This

allows DarkBERT to be trained on English texts

as the vast majority of Dark Web content is in En-

glish (Jin et al., 2022; He et al., 2019). A total of

around 6.1 million pages was collected. The full

statistics of the crawled Dark Web data is shown in

Table 8 of the Appendix.

3.2

Data Filtering and Text Processing

While the text data collected in Section 3.1 is of

considerable size, a portion of the data contains no

meaningful information such as error messages or

duplicates of other pages. Therefore, we take three

measures — removal of pages with low informa-

tion density, category balancing, and deduplication

— to retain useful page samples in the pretraining

corpus and remove unnecessary pages. In addition,

it is critical that the model does not learn represen-

tations from sensitive information. Although a pre-

vious study stated that language models pretrained

with sensitive data are unable to extract sensitive

information with simple methods, the possibility

cannot be ruled out using more sophisticated at-

tacks (Lehman et al., 2021). To this end, we pre-

process the pretraining corpus to address ethical

considerations using identifier masks or removing

texts entirely, depending on the type of the target

text. The details of filtering and text preprocessing

are described in Sections B and C of the Appendix.

https://ahmia.fi/

Raw

Preprocessed

2.25

2.00

Corpus

Training Loss vs. Epochs for DarkBERT

type

2.50

1.75

1.50

1.25

1.00

5000

10000

15000

Training Steps

20000

Figure 2: Training steps vs. training loss graph for raw

and preprocessed versions of DarkBERT

3.3

DarkBERT Pretraining

In order to observe the impact of text preprocess-

ing on DarkBERT’s performance, we build two

versions of DarkBERT: one with raw text data

(whitespace removal applied) and the other with

preprocessed text following Section 3.2. The size

of each pretraining corpus is shown in Table 1, and

the training losses for the two models are shown in

Figure 2.

We leverage an existing model architecture in-

stead of starting from scratch for pretraining. This

is done to reduce computational load and retain

the general English representation learned by the

existing model. We choose RoBERTa (Liu et al.,

2019) as our base initialization model as it opts

out of the Next Sentence Prediction (NSP) task

during pretraining, which may serve as a benefit

to training a domain-specific corpus like the Dark

Web as sentence-like structures are not as prevalent

compared to the Surface Web.

The Dark Web pretraining text corpus is fed to

the roberta-base model in the Hugging Face 3

library as an initial base model. For compatibil-

ity between DarkBERT and RoBERTa, we use the

same BPE (byte-pair encoding) tokenization vocab-

ulary used in the original RoBERTa model, with

each page in the pretraining corpus separated using

RoBERTa’s separator token . The two ver-

sions of DarkBERT only differ in the corpus used

for pretraining (raw vs. preprocessed); all other fac-

tors such as training hyperparameters are equally

set. The models are pretrained using a script writ-

ten in PyTorch (Paszke et al., 2019). Additional

https://huggingface.co/Table 2: Dataset statistics used for Dark Web activity

categorization.

DUTA (DUTA-10K)