Summary of GDlog A GPU-Accelerated Deductive Engine

New Summary

Summary GDlog A GPU-Accelerated Deductive Engine arxiv.org

11,234 words - PDF document - View PDF document

Chat with this pdf Buy me a coffee

One Line

GDlog is a deductive engine that utilizes GPU parallelism and SIMD hash tables to enhance performance.

Slides

Slide Presentation (10 slides)

Copy slides outline Copy embed code Download as Word

GDlog: Enhancing Performance of Deductive Database Engines

Source: arxiv.org - PDF - 11,234 words - view

Introduction

• GDlog is a GPU-accelerated deductive engine that improves the performance of deductive database engines.

• It utilizes GPU parallelism and SIMD hash tables for enhanced performance.

• GDlog is built upon a novel data structure called Hash-Indexed Sorted Array (HISA).

HISA - Efficient Range Querying and Deduplication

• HISA enables efficient range querying and deduplication.

• It is a key component of GDlog's performance improvement.

• HISA allows for optimized algorithmic complexity.

Significant Performance Improvements

• GDlog achieves roughly 10x runtime improvements on large deductive-analytic workloads.

• It outperforms prior systems in terms of runtime and memory footprint.

• Competitive performance with modern SIMD hash tables.

Leveraging GPU Parallelism

• GDlog addresses scalability issues and performance challenges faced by CPU-based deductive engines.

• It leverages the parallelism and high-throughput capabilities of GPUs.

• The engine uses HISA as its tuple representation, enabling parallel insertion and leveraging GPU throughput.

Novel Strategies for Datalog on the GPU

• GDlog employs eager buffer management and temporarily-materialized n-ary joins.

• These strategies optimize performance for Datalog on the GPU.

• Eager buffer management reduces buffer allocation overhead during tail iterations.

Performance Evaluation

• GDlog has been extensively evaluated and compared to existing CPU and GPU-based engines.

• It consistently outperforms other engines, achieving significant speedup ratios.

• Improvements of up to 10x on large-scale deductive-analytic workloads compared to CPU-based engines.

Practicality for Program Analysis

• GDlog delivers stable performance and significant speedup for context-sensitive program analysis queries.

• It outperforms CPU-based solutions in the context of program analysis.

• GDlog's efficient utilization of GPU parallelism makes it a promising tool for complex data analysis.

Promising Tool for High-Throughput Deductive Queries

• GDlog's use of HISA and novel strategies make it a promising tool for high-throughput deductive queries.

• It offers competitive performance with modern SIMD hash tables.

• GDlog addresses scalability and performance challenges faced by CPU-based engines.

GDlog: Empowering High-Performance Deductive Analytics

• GDlog is a GPU-accelerated deductive engine that improves the performance of large-scale deductive analytic queries.

• Its memory management, join algorithms, and HISA data structure contribute to superior performance.

• GDlog is a powerful tool for complex data analysis and program analysis tasks.

Key Points

GDlog is a GPU-accelerated deductive engine that improves the performance of deductive database engines.
GDlog uses a novel data structure called Hash-Indexed Sorted Array (HISA) for efficient range querying and deduplication.
GDlog achieves significant performance improvements compared to prior systems, with runtime improvements of roughly 10x on large deductive-analytic workloads.
GDlog leverages the parallelism and high-throughput capabilities of GPUs to address scalability issues and performance challenges faced by CPU-based deductive engines.
GDlog offers competitive performance with modern SIMD hash tables and outperforms prior work in terms of runtime and memory footprint.
GDlog employs eager buffer management and temporarily-materialized n-ary joins as novel strategies for Datalog on the GPU.
GDlog demonstrates its performance as a high-throughput SIMD hash table and is compared to both CPU and GPU-based systems for deductive-analytic queries.
GDlog's use of the HISA data structure and novel strategies make it a promising tool for high-throughput deductive queries in various applications.

Summaries

21 word summary

GDlog is a GPU-accelerated deductive engine that improves performance by leveraging GPU parallelism and achieving competitive performance with SIMD hash tables.

69 word summary

GDlog is a GPU-accelerated deductive engine that significantly improves the performance of deductive database engines. It addresses scalability and performance challenges by leveraging the parallelism and high-throughput capabilities of GPUs. GDlog uses HISA as its tuple representation, enabling parallel insertion and achieving competitive performance with modern SIMD hash tables. It offers a GPU-accelerated solution for deductive analytics, outperforming prior work in runtime while offering a more favorable memory footprint.

118 word summary

GDlog is a GPU-accelerated deductive engine that improves the performance of deductive database engines. It achieves significant performance improvements, with runtime improvements of roughly 10x on large deductive-analytic workloads. GDlog addresses scalability and performance challenges by leveraging the parallelism and high-throughput capabilities of GPUs. It uses HISA as its tuple representation, enabling parallel insertion and leveraging the massive throughput of GPUs. GDlog demonstrated competitive performance with modern SIMD hash tables and outperformed prior work in runtime while offering a more favorable memory footprint. The key contributions of GDlog include the development of HISA, a data structure enabling efficient range querying and deduplication. GDlog offers a GPU-accelerated solution for deductive analytics, achieving significant performance improvements compared to prior systems.

484 word summary

GDlog is a GPU-accelerated deductive engine that improves the performance of deductive database engines. It uses a novel data structure called HISA for efficient range querying and deduplication. GDlog achieves significant performance improvements, with runtime improvements of roughly 10x on large deductive-analytic workloads.

Traditional CPU-based deductive engines face scalability and performance challenges. GDlog addresses these challenges by leveraging the parallelism and high-throughput capabilities of GPUs. It uses HISA as its tuple representation, enabling parallel insertion and leveraging the massive throughput of GPUs.

To evaluate GDlog's performance, it was compared against CPU and GPU-based hash tables and Datalog engines. GDlog demonstrated competitive performance with modern SIMD hash tables and outperformed prior work in runtime while offering a more favorable memory footprint.

GDlog's implementation involves several steps in the semi-naive evaluation process. It maintains indices on each joined relation, executes relational algebra kernels, removes duplicates, and merges delta relations. GDlog uses a join operation that combines hash and sorted joins, making it suitable for recursive query scenarios on GPUs.

GDlog employs two memory-for-time trade-offs in its implementation. Eager buffer management pre-allocates a larger buffer for merging delta and full relations, saving time on buffer allocation in subsequent iterations. The separation of the delta relation population into a distinct phase allows for efficient removal of duplicated tuples.

Overall, GDlog offers a GPU-accelerated solution for deductive analytics, achieving significant performance improvements compared to prior systems. Its use of the HISA data structure and novel strategies for Datalog on the GPU make it a promising tool for high-throughput deductive queries in various applications.

GDlog is designed to improve the performance of large-scale deductive analytic queries. It utilizes techniques such as semi-naive evaluation, indexing, and range querying to achieve optimal algorithmic complexity. The engine incorporates a novel data structure called HISA, which effectively leverages the massive parallelism available on modern GPUs.

Extensive evaluation demonstrates that GDlog consistently outperforms CPU and GPU-based engines. It achieves significant speedup ratios, with improvements of up to 10x on large-scale deductive analytic workloads. GDlog also outperforms specialized engines in terms of memory usage and avoiding out-of-memory errors.

GDlog's practicality for program analysis is demonstrated, delivering stable performance and significant speedup compared to CPU-based solutions. Its efficient utilization of GPU parallelism makes it a promising option for high-precision program-analysis queries.

In conclusion, GDlog is a GPU-accelerated deductive engine that leverages GPU parallelism for significant performance improvements on large-scale deductive analytic queries. Its memory management strategy, Eager Buffer Management optimization, and efficient join algorithms contribute to its superior performance. GDlog's practicality for program analysis further highlights its potential as a powerful tool for complex data analysis.

594 word summary

GDlog is a GPU-accelerated deductive engine that aims to improve the performance of deductive database engines used in various applications. It utilizes a novel data structure called the hash-indexed sorted array (HISA) for efficient range querying and deduplication. GDlog achieves significant performance improvements compared to prior systems, with runtime improvements of roughly 10x on large deductive-analytic workloads.

In traditional CPU-based deductive engines, scalability and performance challenges arise due to design limitations. GDlog addresses these challenges by leveraging the parallelism and high-throughput capabilities of GPUs. It uses HISA as its tuple representation, enabling parallel insertion and leveraging the massive throughput of GPUs.

To evaluate GDlog's performance, it was compared against CPU and GPU-based hash tables and Datalog engines. The evaluation included large-scale deductive queries such as reachability and program analysis. GDlog demonstrated competitive performance with modern SIMD hash tables and outperformed prior work in runtime while offering a more favorable memory footprint.

The key contributions of GDlog include the development of HISA, a data structure enabling efficient range querying and deduplication. GDlog is a CUDA-based library that allows for high-throughput deductive analytics applications on the GPU. It leverages eager buffer management and temporarily-materialized n-ary joins as novel strategies for Datalog on the GPU. GDlog's evaluation demonstrates its performance as a high-throughput SIMD hash table and compares it to CPU and GPU-based systems for deductive-analytic queries.

GDlog's memory management strategy optimizes performance by efficiently allocating buffers based on the size of relations. It introduces eager buffer management to reduce allocation overhead during tail iterations, improving performance for queries with long tail behavior. GDlog's join algorithms divide the join into sub-joins and allocate them across worker threads, aligning with the SIMD architecture of GPUs.

1045 word summary

GDlog is a GPU-accelerated deductive engine that aims to improve the performance of modern deductive database engines. These engines are used in various applications such as program analysis, social media mining, and business analytics. GDlog is built upon a novel data structure called the hash-indexed sorted array (HISA), which allows for efficient range querying and deduplication. The engine achieves significant performance improvements compared to prior systems, with runtime improvements of roughly 10x on large deductive-analytic workloads.

In traditional CPU-based deductive engines, recursive queries are evaluated using incrementalized (semi-naive) evaluation and nested loop joins over in-memory tables. However, these engines face scalability issues and performance challenges due to their design limitations. GDlog addresses these challenges by leveraging the parallelism and high-throughput capabilities of GPUs. It uses HISA as its tuple representation, which enables parallel insertion and leverages the massive throughput of GPUs.

To evaluate the performance of GDlog, the engine was compared against both CPU and GPU-based hash tables and Datalog engines. It was used to support a range of large-scale deductive queries, including reachability, same generation, and context-sensitive program analysis. The evaluation showed that GDlog achieves competitive performance with modern SIMD hash tables and outperforms prior work by a significant factor in runtime while offering a more favorable memory footprint.

The key contributions of GDlog include the development of the Hash-Indexed Sorted Array (HISA), a data structure that enables efficient range querying and deduplication. GDlog is a CUDA-based library that allows for high-throughput deductive analytics applications on the GPU. It leverages two novel strategies for Datalog on the GPU: eager buffer management and temporarily-materialized n-ary joins. The evaluation of GDlog demonstrates its performance as a high-throughput SIMD hash table and compares it to both CPU and GPU-based systems for deductive-analytic queries.

The implementation of GDlog involves several steps in the semi-naive evaluation process. These steps include maintaining indices on each joined relation, executing relational algebra kernels on the delta relation, removing duplicates in the new relation, and merging the delta of every relation with the full. GDlog uses a join operation that combines the benefits of hash and sorted joins, making it suitable for recursive query scenarios on GPUs. The join process involves serializing the outer relation and partitioning it into chunks, querying the inner relation's indexing hash table, and performing a scan of the sorted data array to generate join result tuples.

GDlog also employs two memory-for-time trade-offs in its implementation. The first trade-off is eager buffer management, which involves pre-allocating a larger buffer for merging delta and full relations to save time on buffer allocation in subsequent iterations. The second trade-off is the separation of the delta relation population into a distinct phase, allowing for efficient removal of duplicated tuples.

GDlog is a GPU-accelerated deductive engine designed to improve the performance of large-scale deductive analytic queries. It utilizes techniques such as semi-naive evaluation, indexing, and range querying to achieve optimal algorithmic complexity. The engine incorporates a novel data structure called HISA, which effectively leverages the massive parallelism available on modern GPUs.

One important aspect of GDlog is its memory management strategy. Before performing fixpoint checking, GDlog merges tuples within the delta relation into the full relation and removes tuples from the new relation. Allocating buffers efficiently is crucial for optimizing performance. GDlog introduces a memory management algorithm that determines the required buffer size based on the size of the full and delta relations. If the existing buffer is not large enough, GDlog suggests allocating a new buffer with a size equal to the sum of the full and delta relations. However, if the proposed size exceeds the available GPU memory, the algorithm gradually reduces the buffer size until it fits within the available memory. This proactive approach optimizes memory allocation for efficient evaluation.

In scenarios where the delta relation contains numerous tuples, the size of the delta relation tends to gradually decrease in the last few iterations towards the final fixpoint. During these tail iterations, creating buffers becomes a costly operation because the full relation contains a considerable number of tuples. To address this issue, GDlog introduces Eager Buffer Management, which reduces buffer allocation overhead during tail iterations. This optimization is particularly effective for queries characterized by long tail behavior, such as network analysis tasks. However, for queries with a short tail, it is advisable to disable this optimization to avoid unnecessary memory overhead.

In terms of parallel evaluation, GDlog employs two methods for dividing the join into sub-joins and allocating them across worker threads. The first approach partitions the join based on tuples in the outermost relation, while the second approach involves using a temporary materialized buffer in joins. The second approach is more suitable for GPU-based systems, as it aligns with the SIMD architecture of GPUs and eliminates idle threads caused by conditional branching. However, materialized temporary joins require extra memory space, creating a trade-off between space and time.

GDlog has been extensively evaluated and compared to existing CPU and GPU-based deductive engines. The evaluation includes well-established datalog queries and real-world datasets. The results demonstrate that GDlog consistently outperforms other engines, achieving significant speedup ratios. In particular, GDlog shows improvements of up to 10x on large-scale deductive analytic workloads compared to CPU-based engines. It also outperforms GPUJoin, a specialized engine for reachability queries, in terms of memory usage and avoiding out-of-memory errors.

The practicality of GDlog for program analysis is also demonstrated. GDlog is shown to deliver stable performance and significant speedup compared to CPU-based solutions in the context of context-sensitive program analysis queries. GDlog's efficient utilization of GPU parallelism makes it a promising option for high-precision program-analysis queries on extensive open-source projects.

In conclusion, GDlog is a GPU-accelerated deductive engine that leverages GPU parallelism to achieve significant performance improvements for large-scale deductive analytic queries. Its memory management strategy, Eager Buffer Management optimization, and efficient join algorithms contribute to its superior performance compared to existing CPU and GPU-based engines. GDlog's practicality for program analysis tasks further highlights its potential as a powerful tool for complex data analysis.

Raw indexed text (75,195 chars / 11,234 words / 1,442 lines)

GDlog: A GPU-Accelerated Deductive Engine

Yihao Sun

Syracuse University

Syracuse, USA

[email protected]

Ahmedur Rahman Shovon

University of Illinois Chicago

Chicago, USA

[email protected]

Kristopher Micinski

Thomas Gilray

The University of Alabama at

Birmingham

Birmingham, USA

[email protected]

Sidharth Kumar

Syracuse University

Syracuse, USA

[email protected]

University of Illinois Chicago

Chicago, USA

[email protected]

ABSTRACT 1

Modern deductive database engines (e.g., LogicBlox and Soufflé)

enable their users to write declarative queries which compute

recursive deductions over extensional data, leaving their high-

performance operationalization (query planning, semi-naïve eval-

uation, and parallelization) to the engine. Such engines form the

backbone of modern high-throughput applications in static anal-

ysis, security auditing, social-media mining, and business analyt-

ics. State-of-the-art engines are built upon nested loop joins over

explicit representations (e.g., BTrees and tries) and ubiquitously

employ range indexing to accelerate iterated joins. In this work, we

present GDlog: a GPU-based deductive analytics engine (imple-

mented as a CUDA library) which achieves significant performance

improvements (5–10× or more) versus prior systems. GDlog is

indexed sorted array (HISA). We perform extensive evaluation on

GDlog, comparing it against both CPU and GPU-based hash ta-

bles and Datalog engines, and using it to support a range of large-

scale deductive queries including reachability, same generation,

and context-sensitive program analysis. Our experiments show

that GDlog achieves performance competitive with modern SIMD

hash tables and beats prior work by an order of magnitude in run-

time while offering more favorable memory footprint. High-performance deductive database engines power state-of-the-

art efforts in program analysis [7, 9, 13, 14], social-media min-

ing [19, 43, 44], and business analytics [37]. Such engines enable a

user to specify deductions (forming an intensional database, hence-

forth IDB) over extensionally manifest relations (the extensional

database, henceforth EDB). For example, given an input relation

Edge(from,to), the following two rules may be used to compute

Edge’s transitive closure:

PVLDB Reference Format:

Yihao Sun, Ahmedur Rahman Shovon, Thomas Gilray, Kristopher Micinski,

and Sidharth Kumar. GDlog: A GPU-Accelerated Deductive Engine.

PVLDB, 14(1): XXX-XXX, 2020.

doi:XX.XX/XXX.XX

PVLDB Artifact Availability:

The source code, data, and/or other artifacts have been made available at

URL_TO_YOUR_ARTIFACTS.

This work is licensed under the Creative Commons BY-NC-ND 4.0 International

License. Visit https://creativecommons.org/licenses/by-nc-nd/4.0/ to view a copy of

this license. For any use beyond those covered by this license, obtain permission by

emailing [email protected]. Copyright is held by the owner/author(s). Publication rights

licensed to the VLDB Endowment.

Proceedings of the VLDB Endowment, Vol. 14, No. 1 ISSN 2150-8097.

doi:XX.XX/XXX.XX

INTRODUCTION

Reach(from, to)

←

Edge(from, to).

Edge(from, mid), Reach(mid, to).

The first rule copies Edge into Path, and may be implemented via

standard techniques (e.g., selection in SQL [11]). The second rule

constitutes the bulk of the computation, and requires iteration to

a fixed-point: at each iteration, the engine discovers possibly-new

tuples in Reach ⊲⊳ Edge and merges them into Reach until no new

reachable edges are found. While traditional RDBMS systems do

offer support for recursive queries in principle (e.g., SQL’s common

table expressions [8]), their usage in practice has been limited by

their notoriously poor performance and lack of efficient query plan-

ning. These performance challenges stem from the substantively

different concerns typically faced by large-scale deductive analytics

tasks. Modern engines employ incrementalized (semi-naïve) eval-

uation, compiling queries into nested loop joins over in-memory,

index-organized tables [50] represented via BTrees [23], tries [22],

or a similar relation-backing data structure [42].

In this paper we introduce GDlog, a GPU-accelerated deduc-

tive engine which demonstrates runtime improvements of roughly

10× versus an ideally-configured state-of-the-art CPU-based en-

gine (Soufflé) on large deductive-analytic workloads (e.g., context-

sensitive points-to analysis of httpd, see Section 6 for our results).

GDlog achieves these results by leveraging a novel SIMD data struc-

ture we designed specifically for the implementation of modern

Datalogs on the GPU: the hash-indexed sorted array (HISA). Com-

pared to CPU-based representations, HISA enables parallel insertion

(which dominates 78% of Soufflé’s runtime at 32 threads on transi-

tive closure, see the beginning of this section) and harnessing the

massively-greater throughput of modern GPUs. Compared to mod-

ern GPU-based hash tables (e.g., cuDF, GPUJoin), HISA supports

efficient range querying (crucially necessary to avoid asymptotic

overhead).We have evaluated GDlog by using it to implement several

deductive-analytic tasks and testing them at high scale versus the

best-in-class CPU and GPU-based alternatives. Specifically, we have

evaluated the performance of reachability, same generation, and

context-sensitive program analysis on a diverse range of graphs,

programs, and engines (comparing against cuDF, GPUJoin, and

Soufflé). We observe much improved (5×) performance against state-

of-the-art GPU-based joins due to efficient range querying, and

also observe significantly reduced memory footprint. For example,

in Section 6.4 we show cuDF OOMs on five of six graphs, while

GPUJoin OOMs on two; GDlog executes successfully for all six.

In sum, our contributions are as follows:

• The Hash-Indexed Sorted Array (HISA), a novel relation-

backing data structure which enables efficient range query-

ing while leveraging the practical efficiencies of sorted ar-

rays in the context of SIMD.

• GDlog, a CUDA-based library for building high-throughput

deductive analytics applications on the GPU; GDlog uses

HISA as its tuple representation. GDlog also leverages two

novel strategies apropos Datalog on the GPU: eager buffer

management and temporarily-materialized 𝑘-ary joins.

• An evaluation of GDlog which evaluates its performance

as a high-throughput SIMD hash table.

• An evaluation of our eager buffer management strategy,

showing that it significantly improves buffer allocation.

• An evaluation comparing GDlog to both CPU and GPU-

based systems for deductive-analytic queries, showing that

GDlog outperforms every system against we evaluate (of-

ten by a significant factor).

Figure 1: Running REACH query in Soufflé using different

cores (Using com-dblp as dataset).

pairs (from, mid), while Path will be range-queried on mid to obtain

a set of suitable to.

auto part = rel_delta_reach->partition();

// Parallel nested-loop join, pfor uses OpenMP

pfor(auto it = part.begin(); it < part.end(); it++) {

for (const auto &env0 : *it) {

// range query: edge indexed on its second column

auto range = rel_edge->lowerUpperRange_10(...)

for (const auto &env1 : range) {

if (!rel_reach->contains(...) && ...)

rel_new_reach->insert(...);

} } }

// serial insertion from new to delta

for (const auto &env0 : *rel_new_reach) {

// insert new tuple into full

rel_reach->insert(...);

}

We begin by surveying the design and implementation of mod-

ern CPU-based deductive database systems, particularly remarking

upon their scalability bottlenecks. We next shift to discussing con-

cerns regarding a relation-backing SIMD data structure for iterated

joins on the GPU, before introducing our solution (the HISA) in

Section 4. Section 5 then discusses how we leverage HISA to build

GDlog. Our evaluation is in Section 6, after which we conclude.

Listing 1: C++ code generated by Soufflé (reachability)

Soufflé evaluates queries by compiling them first into relational

algebra plans which are subsequently emitted as C++-based kernels.

Among these kernels, nested-loop joins operating over tries account

for the tight loop of computation in practice. For example, listing 1

shows the C++ code corresponding to the reachability query. The

join operation is realized by iterating over the outer joined relation

(i.e., the Reach relation ) and efficiently retrieving join results

from the inner relation (i.e., the Edge relation) using a prefix range

query. Soufflé anticipates multicore CPUs by employing OpenMP

to perform range partitioning to scan the outer relation.

Unfortunately, while Soufflé employs range partitioning to par-

allelize joins, tuple insertion remains serialized. Our benchmarks

of reachability in Figure 1 show that scaling efficiency almost stalls

between 8 and 16 threads, after which serial insertion dominates

the runtime. The performance improvement is a modest 7% when

doubling the total number of threads from 16 to 32. Figure 2 shows

a further breakdown of runtime from a hand-instrumented Soufflé

program implementing reachability (input graph of 1.9B tuples) at

32 threads. The figure illustrates that inserting tuples from the delta

to the full relation (depicted in red) consumes a significant portion

of the total runtime, accounting for 77.8%. This result indicates that

the serial insertion has become a bottleneck in deduction-heavy

BACKGROUND AND MOTIVATION

Mainstream deductive databases (e.g., XSB, RecStep, LogicBlox,

and Soufflé [2, 13, 21, 41]) predominantly rely upon CPU-based

implementations, with Soufflé (maintained by Oracle Labs) hav-

ing emerged as the de-facto standard for large-scale Datalog on

shared memory systems. Soufflé’s best-in-class performance is due

to the efficiencies of partitioned iterated joins (done in parallel via

OpenMP) over B-Trees and tries [22, 23] which avoid scanning and

instead employ range indexing ubiquitously. To ensure optimal join

performance, Soufflé employs a specialized indexing technique that

utilizes a sorted prefix tree [23]. This data structure, often referred

to as a trie, facilitates efficient retrieval of tuples with matching

prefix columns. It significantly improves join operations by index-

ing all relations involved in the join process, and prefix-sharing

enables a more compressed representation versus storing separate

indices for each index. For example, in the Edge relation of the

reachability query, tuples are stored in a trie which orders the from

column first, followed by the to column: this anticipates the fact

that when performing the join, Edge will be iterated to enumerate

2workloads Additionally, it’s worth noting that even in cases where

parallelization on insertion is possible, the specialized b-tree data

structure’s use of locks and its cache-related design often leads to

suboptimal parallelism when operating with a substantial number

of threads. Although partitioning successfully divides the large

serial join into parallel subjoins on each process, this workload

division only takes the outermost relation into consideration and

may cause imbalanced workloads and low CPU utilization when

multiple joins are nested.

semi-naïve evaluation, the absence of efficient deduplication, and

certain constraints in its data structure design (such as the absence

of support for multi-valued keys). Our work bridges these gaps

by developing a more general-purpose data structure that effec-

tively supports deduplication and range queries – both necessary

to support iterated parallel joins (relational algebra in general).

RELATED WORK

A long-standing debate in CPU-based database join implementa-

tions revolves around hash-joins versus sort-merge-joins. Notably,

for highly parallel hardware like GPUs, work by Kim et al [27].

suggests that sort-merge-join methods may outperform hash joins.

RateupDB [29], a recent GPU-CPU hybrid RDMS similar to Red

Fox, prioritizes the acceleration of OLAP workloads using GPUs.

Its authors conducted an extensive survey of join implementations

and selected a sort-merge-join approach, aligning with the notion

that recent research in GPU path merge and sorting [16, 34, 49]

algorithms show very promising performance.

GPU data-structures. Ongoing exponential advancements in GPU

performance, and their suitability for data-intensive tasks due to

their inherent SIMD parallelism, make them an appealing candidate

for executing deductive queries. Hash tables are central to efficient

implementations of GPU-based joins. However, designing efficient

hash tables on GPUs poses challenges due to their unique archi-

tecture which relies on massive parallelism and throughput. Hash

tables require atomic operations and irregular memory accesses to

resolve collisions, limiting their performance on GPU hardware [1].

WarpCore [24] a single GPU-based hash table removes the barrier

of 32-bit single value hash table from their prior single node multi

GPU based WarpDrive [25]. Awad et al. conducted an in-depth

experimental analysis of probing schemes for static hash table oper-

ations on GPUs [6]. Although static hash tables are well-suited for

GPU architectures, several studies have been conducted on devel-

oping efficient dynamic hash tables on GPUs with lower build rates

compared to static hash table [4, 32]. A hybrid CPU-GPU approach

for accelerating hash table operations in a multi-node environment

showed promising performance gains from GPU query accelera-

tion and a hybrid model handling shorter keys on GPU and longer

keys on CPU [18]. cuCollections provides an open-addressing-based

static hash table with linear probing that can be used to perform

hash join operations. HashGraph [15] introduced a novel hash ta-

ble design for GPUs that draws on sparse graph concepts, using a

combination of open-addressing and separate chaining to resolve

hash collisions for single-arity hash joins.

Figure 2: Runtime Analysis of Soufflé: Executing Reachabil-

ity at 32 threads on the com-dblp dataset with 1.9B edges.

2.1

Distributive Deductive Engine

Traditional CPU-based deductive engines like Soufflé face scalabil-

ity issues in keeping pace with modern parallel hardware. Research

now emphasizes scalability over single-process performance. Big-

Datalog and Radlog(RaSQL) [19, 44] use Apache SPARK to distribute

data to worker nodes based on hash values, allowing queries on

large datasets. However, frequent network communication impacts

query speed, and adding nodes doesn’t fully utilize CPU cores.

Balanced Parallel Relation Algebra (BPRA) [28] leverages MPI on

supercomputers, achieving scalability by partitioning data based

on hash values and balancing MPI ranks for efficient CPU-based

cluster processing.

Modern GPUs offer significantly high degrees of data paral-

lelism through their high-throughput SIMD architecture. There has

been plenty of work discussing the implementation of standalone

joins [20, 27, 38–40, 48] on the GPU. However, only a handful of

work deals with iterative joins, especially in the context of modern

GPU architectures. One notable branch of work is the development

of GPU Relational Database Management Systems (RDMS), Red

Fox [53], created by Wu et.al. Red Fox is based on the closed-source

commercial datalog engine LogicBlox The Red Fox paper details

how it compiles the full LogiQL (a language dialect of datalog) [17]

into CUDA-based relational algebra operations and evaluates its

performance against traditional RDMS benchmarks like TPC-H [51].

These relational algebra operators are implemented using NVIDIA’s

thrust and CUB libraries. In Red Fox, tuples are stored in a hash ta-

ble residing in GPU memory, using the first attribute as the key and

the others as values. However, Red Fox has limitations in handling

deductive queries due to the absence of support for GPU-based

GPU Deductive engine. Although there is no mature implemen-

tation of GPU based deductive engine, some research prototypes,

such as GPUDatalog [33], are emerging in this direction. GPU-

Datalog, for instance, employs a similar join implementation as a

GPU RDMS called GDB [20] and successfully implements iconic

deductive queries like “transitive closure” and “same generation”.

However, it faces challenges, including sorting tuples in each join

candidate and indexing them using Cache Sensitive Search (CSS)

Tree [35] at the start of each join, which adds extra computational

3overhead. Additionally, data rows in each relation are not guaran-

teed to be unique, leading to significant performance degradation

in terms of time and space.

Another potential solution is cuDF, a GPU version of the popu-

lar Python library pandas within NVIDIA’s RAPIDS project [36].

Shovon et al. explored implementing a deductive engine using

cuDF [46]. However, cuDF’s API limitations make it challenging to

fuse relational algebra (RA) operations, leading to extensive materi-

alization requirements for holding intermediate results, potentially

causing memory overflow issues with large queries.

In their subsequent work, GPUJoin [47], the authors addressed

these challenges by using hand-written CUDA kernels with open-

addressing hash table-based relations to fuse RA operations and re-

duce memory usage. This approach successfully implemented tran-

sitive closure using a semi-naïve evaluation-like method, demon-

strating that GPU deductive engine performance can surpass Soufflé.

However, it’s important to note that this work specifically covers

transitive closure and cannot be extended to general deductive

queries used in real-world program and graph analysis due to the

limitations that its data structure construction supports only two-

column relations, and its join algorithm restricts joins solely be-

tween the delta relation and the full relation. When evaluating

general Datalog queries, the ability to perform joins between full-

to-full and delta-to-delta relations is essential.

and inserts the key pointers in lexicographic order. At the fourth

tuple, we depict a hash collision scenario, which is mitigated by

linear probing. For the hashing algorithm, we used Murmur3 hash

function [26] and invoked CUDA atomic operation for parallel hash

table construction. The index map thus contains the sorted join

key index pointer ( ∗ 11, ∗ 35, ∗ 46, ∗ 97), which enables faster range

query and efficient deduplication process. Each key of the sorted

index map is a pointer to the tuples in the GPU memory, forming

the compact scalar data array.

Algorithm 1 Hash table construction algorithm

10:

11:

12:

13:

14:

THE HASH-INDEXED SORTED ARRAY

15:

16:

17:

18:

19:

20:

21:

22:

23:

24:

Figure 3: Tuple storage layout of relation edge(f, t). The sorted

index map is created based on the join column of the input

relation pointing to the index in the compact data array.

4.1

for each tuple 𝑇 in input relation perform parallel do

Calculate hash value 𝐻 for tuple 𝑇

Calculate index 𝐼 as 𝐻 (mod index map size)

while true do

Check key 𝐾 and value 𝑉 at index 𝐼 in hash table

if 𝐾 is empty or 𝐾 equals 𝐻 then

while true do

if 𝑉 < 𝐼 and tuple at 𝑉 not equal to 𝑇 then

Collision detected, increment 𝐼

Continue outer loop

end if

if 𝑉 greater than tuple index then

Swap 𝑉 with tuple index of 𝑇

end if

break inner loop

end while

Insert (𝐻, tuple pointer of 𝑇 ) into hash table at 𝐼

break outer loop

else

Increment 𝐼

Continue outer loop

end if

end while

end for

Deduplication

In Datalog, deduplication of tuples within fixpoint iterations is a

critical operation. Prior work has demonstrated that performing

deduplication of intermediate results within each iteration affects

the overall performance of recursive or iterative operations in Dat-

alog evaluation [46, 47].

One crucial reason for maintaining values sorted in our index-

ing map is to optimize this process for GPU execution. The GPU’s

exceptional parallelism lends itself well to scanning sorted arrays

efficiently, enabling the rapid identification and removal of dupli-

cates. In contrast, in a conventional multi-value hashtable, ensuring

the uniqueness of values necessitates probing and comparing with

all tuples within the same hash bucket, which can result in unnec-

essary redundant comparing within the bucket when inserting a

new value. This approach is less efficient than the combination of

fast radix sorting and unique operation, which takes full advantage

of the GPU’s parallelism. Additionally, sorting for deduplication

simplifies the data structure merge operation, as it boils down to

merging sorted values in the indexing map. This merging process is

In this section, we introduce the Hash-Indexed Sorted Array

(HISA), a hybrid SIMD data structure designed for optimal execution

of deduction-heavy workloads on the GPU. HISA harmonizes the

algorithmic efficiencies of indexing with the high performance

of sorted arrays. HISA draws inspiration from HashGraph [15]

and builds upon it to create a compact and serializable hybrid data

structure. This structure is particularly optimized for efficient range

queries. In Figure 3, we illustrate how HISA establishes a mapping

between a compact data array and a sorted index map. This mapping

is realized through a grid stride loop executed by CUDA threads,

sorting the join column of the input relation and linking it with the

compact data array. Selecting the first column of the given input

relation as the join key, at first, the sorted index map is created. In

this scenario, the keys are 35, 11, 46, and 97. The sorted index map

creates an open-addressing-based hash table with a low load factor

4also highly GPU-friendly, achieved by coarsely partitioning the in-

dexed map into several sorted chunks in GPU memory and merging

them with fast parallel primitives. Overall, this sorted merge design

harnesses the strengths of the GPU for massively parallel sorting,

scanning, and merging and significantly enhances the efficiency of

the deduplication operations in the fixed-point iterations.

4.2

struct GHashRelContainer {

// open addressing hashmap for indexing

OpenAddressMap *index_map = nullptr;

tuple_size_t index_map_size = 0;

float index_map_load_factor;

// flatten tuple data

column_type *data_raw = nullptr;

tuple_size_t tuple_counts = 0;

int arity; bool tmp_flag = false;

};

__global__ void load_relation_container(

GHashRelContainer *target, int arity,

column_type *data, tuple_size_t data_size,

tuple_size_t index_column_size,

float index_map_load_factor);

__global__ void create_index(

GHashRelContainer *target, tuple_indexed_less cmp);

__global__ void get_join_result_size(

OpenAddressMap *inner, column_type *outer,

int join_column_counts, tuple_size_t *join_result_size);

__global__ void get_join_result(

OpenAddressMap *inner, column_type *outer,

int join_column_counts, int output_arity,

column_type *output_raw_data);

Efficient range querying and serialization

Datalog often involves fast-range querying on indexed columns,

commonly linked with join operations. Here, using a hash table

exclusively for the join column, as opposed to a pure sort-merge

join, offers clear advantages. Storing only the join column data in

the hash table enables us to achieve efficient insertion and retrieval

operations based on key values. To address the significant serial-

ization requirements during operations such as efficient recursive

joins on the GPU, which often involve merging delta data into

the full dataset, we have made a strategic choice. Our hash table

only stores the join column and, importantly, the values within this

hash table are sorted to support efficient joins and aggregations.

Moreover, the hash table only contains the columns required for

joins, not the full tuples. This optimization significantly reduces

memory usage while still ensuring fast join performance. The hash

table construction is shown in Algorithm 1. The bulk of the data is

stored in a scalar array tailored for GPU architecture. Simultane-

ously, for inner relations that require fast range queries, we employ

a hashmap designed to store only the indexed columns and the

address of the smallest tuple pertaining to that indexed column.

This approach optimizes space utilization while maintaining the

efficiency of range queries, a key consideration in the context of

Datalog and similar declarative languages. While HashGraph [15]

demonstrates the effectiveness of combining a hashmap with a com-

pact array-style data structure on GPU, it does not directly align

with the specific demands of Datalog. Notably, it lacks support

for handling multi-values and deduplication, which are essential

aspects of Datalog queries. HISA addresses these limitations by

introducing a sorted hash-indexed table mapped with the com-

pact data array. This innovation allows us to utilize mainstream

GPU merge and sort algorithms. By combining the sorted index

map with the compact data array in this manner, we obtain a high-

performance data structure tailored to the needs of recursive joins

on the GPU. HISA’s design aims to provide the best of both worlds

- efficient indexing and high-performance data manipulation - to

support Datalog workloads on GPU architectures.

4.3

Listing 2: HISA data structure’s CUDA C API interface

HISA object. It’s worth noting that HISA is a specialized data struc-

ture designed specifically for deductive reasoning, and as such, it

does not offer a general method for fetching tuples, as a typical

hashtable would. Instead, HISA provides only two kernel functions:

“get_join_result” and “get_join_result_size” for implementing joins.

In CPU-based deductive engines, joins can be readily imple-

mented by iterating over one relation and comparing each tuple

in the iterator with the other relation. However, this iterator-style

serial API doesn’t align with the philosophy of SIMD processors

like GPUs. In GPU-based data structures, all joined column compar-

isons must be bundled together as a bulk operation to allow parallel

execution by the GPU.

In traditional GPU-based hash data structures, such as the static

hashmap in NVIDIA’s cuCollection [10], a bulk retrieve function

and a parallel serialization function are typically defined. Hash

table-based deductive engines like datalog queries in cuDF, imple-

ment parallel joins by serializing one join candidate’s (the outer

relation) hashtable into an array and utilizing the GPU hashtable’s

bulk API to retrieve all tuples that share the same join columns.

A GPU kernel function is then used to project these matched tu-

ples into join result tuples. However, when using HISA, joins are

implemented by directly invoking the “get_join_result_size" and

“get_join_result” CUDA kernels, which take the inner relation’s

hashtable and the outer relation’s sorted array as input. The serial-

ization function becomes unnecessary because all of our tuples are

stored in a compact array that is serialized by default. Our compact

data representation increases data locality by facilitating coalesced

memory accesses– crucial in extracting performance from GPUs.

Similar to other implementations of hash tables, in our join result

kernel HISA also uses bulk data retrieval to allow fast parallel-range

queries.

Implementation of HISA

The CUDA interface for our HISA data structure is presented in List-

ing 2. In HISA, all data is stored in C-style structs named “GHashRel-

Container”. This struct combines an open-addressing hashtable,

referred to as “index_map” in the code, with a compact array for

sorted data, designated as “data_raw” in the code. The entire struc-

ture is stored in unified GPU memory. All operations on HISA

are conducted through four distinct GPU kernel functions. The

“load_relation_container” function initially employs a GPU sorting

algorithm to sort all the tuples and subsequently stores these sorted

tuples in GPU memory. The kernel function “create_index” then

constructs the hashtable based on the loaded raw data within a

5evaluation workflow is therefore to update the indices for each

relation. We adopt a strategy similar to Soufflé. We create a HISA

index for every relation involved in a join operation. For instance,

in the second rule of the REACH query (from Section 1), where

the “Edge” relation is joined based on the second column, a copy

of “Edge” is generated where the relation is indexed on the second

column (the join column). This copy is denoted as 𝐸𝑑𝑔𝑒 2,1 – de-

noting that the second column is used for index computation. All

subsequent recursive computation of the second rule in the REACH

query operates on this copy. Furthermore, when a relation gains

new tuples after merging delta and full relations, these merged

tuples are synchronized with the relation’s indexed copy, ensur-

ing that queries in the next relation can efficiently perform range

queries.

Figure 4: Workflow of semi-naïve evaluation in GDlog .

Managing output buffer size is another challenge that one must

deal with while performing joins. This is because joins are irregular

operations, where the output size is not known in advance. There-

fore, it becomes challenging to determine how much GPU memory

one must allocate to store the result before actually executing the

join operation. To address this challenge, we employ a two-step

approach. First, we perform the join operation once to compute

the size of the result (using “get_join_result_size" ), and then al-

locate GPU memory for the result buffer based on the computed

size. After the memory is allocated, we repeat the join operation

(using “get_join_result" ) to actually fetch and store the join result

in the allocated buffer. This approach ensures that we don’t need

to pre-allocate an oversized buffer to store the join result.

USING HISA FOR DATALOG ON THE GPU

In this section, we will first elucidate how GDlog implements semi-

naïve evaluation, capitalizing on the inherent parallelism of GPU

architecture. After that, we will present two types of memory-for-

time trade-offs used in our semi-naïve evaluation implementation.

5.1

End-to-End semi-naïve Evaluation

Figure 5: Example of a join between relations Edge (inner)

and SG (outer) using a stride size of 16,383 CUDA threads

Datalog gets operationalized to execution of relational algebra ker-

nels in a fixed point loop. An important optimization is to adopt

the Semi-Naiv̈e evaluation strategy that limits computation of each

fixpoint iteration with only the tuple generated in the last finished

iteration, instead of all existing tuples (including the ones discov-

ered in previous iterations). This evaluation involves splitting a

relation into three versions: “new” (containing newly generated

tuples in each iteration), “delta” (holding unique tuples added in

the last iteration), and “full” (comprising all tuples discovered in

all iterations). By performing the join operation solely on the delta

relation, a substantial amount of unnecessary computation is elimi-

nated. Every fixed point iteration of semi-naive evaluation entails

a series of steps: (a) maintaining indices on each joined relation,

(b) executing the relational algebra kernel on delta (c) removing

duplicates in the new relation and storing it as the delta– to be

used as the input to the next iteration, and (d) merging the delta of

every relation with the full and emptying the new. The end-to-end

pipeline with these steps can be seen in Figure 4. We explain each

of these steps in the following sections.

Join. The second step in the fixpoint computation involves the

execution of join operations in GDlog. Leveraging our hybrid data

structure, these join processes are designed to harness the benefits

of both hash and sorted joins, making them well-suited for recursive

query scenarios on GPUs. We use a join operation from the second

rule of the REACH query to illustrate this process (see Figure 5).

In the upper section of the figure, the outer relation (in this case,

the delta relation of the Reach relation) is serialized and divided

into chunks or arrays known as “strides”. The size of each stride

corresponds to the total number of threads employed for concurrent

execution during the join process. Users can configure the stride size

based on their GPU hardware specifications, with a recommended

size being 32 times the number of stream processors. In this example,

we use a stride size of 16,394.

Once the data is partitioned, each worker thread can readily

access tuples within the stride array according to its unique thread

ID number. Thanks to GDlog’s compact sorted array storage of

relation tuples, the process of serializing and partitioning a relation

becomes streamlined. This stands in contrast to a pure hash join,

where efficiently iterating over the hash table and optimizing the

distribution of join operations poses a significant challenge.

Updating Index. Deductive queries, inherently recursive, require

repetitive executions until a fixpoint is reached, demanding exten-

sive range queries on inner relations. These range queries can be

sped up using indices built on the relations. The first phase of the

6The lower section of the figure demonstrates how each thread

queries the inner relation and generates results. As discussed in

Section 4, every relation contains a hash table storing key-value

mappings based on the hash values of unique joined columns. These

mappings indicate the offset of the smallest tuples that share the

same indexing columns within the sorted array containing all re-

lation tuples. Consequently, during the join process, each thread

calculates the hash value of the joined columns, queries the inner

relation’s indexing hash table, and locates the smallest tuple sharing

the same joined columns. Since all tuples are sorted, a linear scan of

the relation’s internal array starting from the smallest tuple offset

facilitates the join. For instance, in the figure, thread 0 operates on

the outer relation tuple 𝑅𝑒𝑎𝑐ℎ(1, 3). It computes the hash value of

the indexed column (in this case, the first column), which results in

the value 0x370ca71c. It then queries the indexing hash table of in-

ner relation 𝐸𝑑𝑔𝑒 2,1 with this value and discovers that the smallest

tuple with 1 as its first column is located at offset 0 in the internal

array of relation 𝐸𝑑𝑔𝑒 2,1 . Subsequently, a scan of the sorted data

array facilitates the join of the tuple 𝐸𝑑𝑔𝑒 2,1 (1, 2) and 𝐸𝑑𝑔𝑒 2,1 (1, 5),

generating join result tuples 𝑅𝑒𝑎𝑐ℎ(3, 2) and 𝑅𝑒𝑎𝑐ℎ(3, 5). It’s im-

portant to note that while we can obtain the join result at this

point, due to GPU limitations, on-the-fly buffer allocation is not

possible. The standard practice, also adopted by other works such

as RateupDB and GPUJoin, is to repeat the above join operation

twice. The first pass is used to compute the size of the join result,

followed by writing the join result into the result buffer during the

second round of join.

Before storing the join results in the new version of the output

relation, an important additional step is the deduplication of join

results in each iteration, so that no repetitive computation is done

in the next iteration. GDlog accomplishes this by utilizing sorting

based on the lexicographic order of tuples, followed by a scan to

eliminate sequential duplicates. Both of these operations are paral-

lelized by the GPU and are readily implemented using functions

from Nvidia’s Thrust library.

with larger query sizes in CPU-based deductive engines. In GDlog

we can take advantage of GPU operation over our data structure’s

internal compact array to parallelize this process. Efficient par-

allelization is achieved by employing a strategy that entails the

coarse yet swift partitioning of both the sorted array of full and

delta relations into several small, sorted tiles, each of which can fit

neatly into a GPU’s execution wraps. This transformation of the

merging process effectively converts it into a divide-and-conquer

problem and, as a result, makes it amenable to seamless acceleration

by SIMD hardware.

5.2

Trade-off: Eager Buffer Management

The first memory-for-time trade-off we made in our semi-naïve

evaluation implementation occurs at the step of merging full and

delta. While the GPU merge operation used to merge full and delta

exhibits excellent time performance, it is not without its costs.

An intermediate step in the GPU merge algorithm necessitates the

loading of two merge candidates into a sequentially shared memory

buffer, the size of which must be at least equal to the sum of the

candidates’ memory sizes. As the recursive query progresses, the

full relation size continually grows, subsequently increasing the

memory requirement for this buffer in each iteration.

Allocating a substantial memory buffer during each iteration

is not only space-inefficient but also time-consuming. As demon-

strated in our experiments in the evaluation section (Section 6.3),

without optimization, buffer allocation can consume over 65% of

the total execution time in certain test cases. To address this issue,

GDlog employs an optimized buffer management strategy. We

named it as Eager Buffer Management(EBM).

Algorithm 2 Allocate buffer for merging delta and full.

Require: allocatedBufferSize, deltaSize, fullSize

Ensure: currentMergeBuffer

1: if allocatedBufferSize < deltaSize + fullSize then

return currentMergeBuffer

3: end if

4: 𝐾 ← 𝛼

⊲ We set 𝛼 = 5

5: freeMemory ← getFreeGPUMemory()

6: while fullSize + deltaSize × K < freeMemory do

𝐾 ← 𝐾 − 1

if 𝐾 = 0 then

throw error(“not enough memory”)

10:

end if

11: end while

12: buffer ← allocate(fullSize + deltaSize × 𝐾)

13: return buffer

Populating delta. The next step in the fixpoint loop involves

populating the delta relation, which will be used in the join phase of

the next iteration. This step is realized by examining the current full

relation and removing the duplicated tuple already present in the

new relation. GDlog accomplishes this using a relational algebra

operator known as “set difference” applied to the new and full

relations. It’s worth noting that some prior work, such as GPUJoin,

fuses this step with the merging step by directly merging the non-

deduplicated delta relation and then deduplicating the merged full

relation. This fusion approach is efficient when the size of the full

relation is small. However, deduplicating the full relation requires

a full scan of all the tuples in the full relation, which becomes

highly expensive as the size of the full relation grows. Therefore,

GDlog takes a different approach by separating the delta relation

population into a distinct phase.

The central concept of EBM is to pre-allocate a buffer larger

than necessary during the initial buffer allocation for merging. This

larger buffer can then be reused across multiple iterations, saving

the time required for buffer allocation in subsequent iterations.

To reuse buffers, GDlog also needs to maintain these larger-than-

necessary buffers throughout the execution of the query, which

can lead to additional GPU memory consumption. Our experiments

have shown that performance improvement (in time) comes at a

1.35× space overhead. This additional memory used for the larger

Merging Full with Delta and Clearing New. The final step before

fixpoint checking involves merging all the tuples within the delta

relation into the full relation, followed by the removal of all tuples

within the new relation. As we’ve previously discussed in the back-

ground and motivation section, this particular phase is inherently

serial and has the potential to become a bottleneck when dealing

7buffer allocation becomes a space-for-time trade-off to optimize

buffer allocation efficiency in later iterations.

We introduce the detail of our memory management strategy

in Algorithm 2. This algorithm is executed before merging the full

and delta relations in each iteration to prepare the required GPU

merge buffer. The algorithm begins by checking whether the exist-

ing buffer size (‘allocatedBufferSize‘) is sufficient to accommodate

the delta tuples generated in the current iteration (‘deltaSize‘) in

addition to the tuples in the full relation (‘fullSize‘). If this condition

is met, it means that the GPU merge operation can reuse the buffer

allocated in the previous iteration (‘currentMergeBuffer‘).

If the existing buffer is not large enough, rather than creating

a buffer with a total size equal to the sum of the full and delta

relations, GDlog suggests allocating a new buffer with a size of

fullSize + deltaSize × 𝐾, where ‘K‘ is initially set to 5. The algorithm

then checks whether the proposed buffer size exceeds the avail-

able GPU memory size (‘freeMemory‘). If the proposed size is too

large, the algorithm gradually reduces ‘K‘ until it can fit within

the remaining GPU memory. If ‘K‘ is reduced to 0, it indicates that

there is insufficient GPU memory to complete the join operation,

leading to an “Out of Memory” (OOM) exception being thrown.

This proactive approach optimizes memory allocation for efficient

semi-naïve evaluation.

In real-world query scenarios, it is an uncommon occurrence for

the evaluation process to exhibit abrupt convergence, especially

when the delta relation contains numerous tuples. Conversely, the

size of the delta relation tends to gradually decrease in the last

few iterations toward the final fixpoint. During these tail iterations,

creating buffers becomes a costly operation because the full relation

contains a considerable number of tuples. However, the merge step,

where the real computation occurs, takes almost no time because

the delta tuples used in merging is small.

Eager Buffer Management effectively reduces the buffer alloca-

tion overhead during these tail iterations, introducing only a small

amount of additional memory overhead. This makes it a particu-

larly cost-effective trade-off for queries characterized by long tail

behavior, a common occurrence in network analysis tasks. How-

ever, for queries with a short tail, such as the Same Generation (SG)

query mentioned in the previous section, it’s advisable to disable

this optimization to avoid incurring unnecessary memory over-

head. By customizing this strategy based on the query’s specific

characteristics, GDlog can optimize its performance for various

scenarios.

5.3

In the parallel evaluation of this join, there are two distinct

methods for dividing the join into sub-joins and allocating them

across worker threads. The first approach involves partitioning the

entire join based on tuples in the outermost relation, which, in the

case of the second rule of the SG query, is tuples in SG delta . This

method is widely and effectively used in multi-core-based deductive

engines like Soufflé.

Multi-core CPUs operate on a Many Instruction Many Data

(MIMD) architecture, allowing each thread to execute different

instructions concurrently without experiencing idle time caused

by conditional branching in the program. However, this workload

division approach doesn’t align with the hardware nature of GPUs.

GPUs utilize a Single Instruction Many Data (SIMD) architecture,

where threads within the same GPU warp can only execute the same

instruction at any given time. Consequently, conditional branching

in the program can lead to some threads idling.

To illustrate this, consider the example in Figure 6. Thread 0 is

assigned to compute the join result for the tuple SG(1, 3). During

the first join with Edge full , it generates numerous temporary result

tuples. Unfortunately, thread 30 and thread 31 within the same GPU

warp as thread 0 do not find any matching tuples. In the subsequent

join between the first set of join results and Edge full , thread 0 must

continue the computation, while threads 30 and 31, even though

they have no work to do, cannot proceed to the next outermost

tuple. They remain idle until thread 0 completes the second join

and returns.

Figure 6: Thread waiting in nested non-materialized join due

to workload imbalance within the same GPU warp

Another approach to dividing the workload among threads in-

volves usage of a temporary materialized buffer in joins. In the case

of the second rule of SG, it can be seen as two sequential binary

joins, as follows:

Trade-off: Temporarily-Materialized Joins

The second type of trade-off occurs when more than 2 relations are

involved in a join. For example, Same Generation (SG) ascertains

whether two nodes in a graph share an identical topological order:

SG(x, y)

←

Tmp(𝑏, 𝑥)

SG(𝑥, 𝑦)

Edge(p, x), Edge(p, y), x ≠ y.

Edge(a, b), SG(a, b), Edge(b, y).

Edge(𝑎, 𝑏), SG(𝑎, 𝑏).

Edge(𝑏, 𝑦), Tmp(𝑏, 𝑥).

In relational algebra, it’s join plan can be written as:

The second rule of SG represents an inductive case that necessi-

tates joining SG with Edge twice within the same iteration. When

employing semi-naïve evaluation, its join plan may be expressed

via the following relational algebra operation:

SG new

←

Tmp

SG new

SG delta ⊲⊳ Edge full

Tmp ⊲⊳ Edge full

In CPU-based systems, this split doesn’t make a significant differ-

ence. However, in GPU-based systems, having two sequential joins

scheduled into two separate executions of GPU kernel functions

SG delta ⊲⊳ Edge full ⊲⊳ Edge full

8is notably different from the first method, where only one GPU

kernel function executes.

For the first join’s workload division, it’s still based on tuples in

SG delta . In contrast, the workload of the second join is divided based

on the Tmp relation tuples. In the previous example, temporary

join result tuples like Tmp(4, 3) and Tmp(5, 3) are assigned to other

idle threads, such as thread 30 and thread 31, rather than all being

processed by the busy thread 0. In this way, all threads in the same

GPU wrap will have a similar workload, eliminating idle threads.

However, the materialized temporary join also has its drawbacks.

It requires extra memory space to store the temporary relations,

becoming a space-for-time trade-off. In the design of GDlog, we

adopt the second method for implementing k-ary joins, as it is more

suitable and aligns with the hardware nature of GPUs. It’s important

to note that this binary relation split is different from binary join

splitting in distributive deductive engines like BPRA [28], as the

intermediate results here don’t need to be persisted. To mitigate

space overhead concerns, tuples in intermediate relations, such as

Tmp, are purged once the join associated with it is finished. Unlike

normal relations, they don’t have multiple version storage for semi-

naïve evaluation, meanwhile, sorting and deduplication are also

not required.

and compiled with GCC 10. All Soufflé datalog queries are compiled

into C++ code using the Soufflé’s compilation flag -j 32, enabling

the utilization of all CPU cores on the test machine. We found this

delivered the best runtime of any thread-count.

Because the focus of our experiment is to evaluate the perfor-

mance of the deductive engine itself, we want to exclude any po-

tential impact of join planning on performance. Thus, we have

Soufflé emit its join plan via the flag –show=transformed-ram. We

then ensure that all tools used in our experiments employ the same

join plan as Soufflé by manually replicating Soufflé’s plan in our

implementations using GDlog, cuDF, and GPUJoin.

Table 1: Comparing running time and memory usage of

Reachability query in GDlog with/without enabling ea-

ger buffer management. “Eager” means when setting Eager

Buffer Management to be enabled, “Normal” means when

Eager Buffer Management is disabled.

Iterations

EVALUATION

In this section, we conduct a comprehensive performance evalua-

tion of GDlog, demonstrating the impact of our optimization tech-

niques. Our assessment begins by analyzing how eager buffer man-

agement influences both query execution time and system memory

usage. We then compare the performance of GDlog against existing

deductive engines, utilizing well-established datalog queries and

real-world datasets. Due to the limited availability of mature GPU-

based deductive engines, we also include a state-of-the-art CPU-

based system in our comparative analysis, . Finally, we demonstrate

the practicality of GDlog for program analysis, showing significant

performance advantages over current CPU-based solutions.

6.1

Query Time (s)

Memory(GB)

Dataset Total Tail Normal Eager Normal Eager

usroads

vsp_finan

fe_ocean

com-dblp

Gnutella31 606

520

247

31 /

491

17 78.08

82.75

72.74

26.95

7.64 33.62

39.02

40.73

21.83

5.58 20.35

20.22

37.97

43.24

20.22 26.84

28.26

50.43

60.18

28.26

6.2

Test Programs and Datasets

In our experiment, we have selected three distinct deductive queries.

The first query is Reachability (REACH), which we previously dis-

cussed in Section 1. REACH is a pure recursive deduction query that

serves as a general performance indicator for a deductive reasoning

engine. Since REACH involves only binary joins, to ensure that

GDlog is not biased towards binary joins, we have also included

the Same Generation (SG) query in our experiment. The query code

for SG was introduced in Section 5.3. Additionally, we included a

complex query, Context-Sensitive Program Analysis (CSPA), which

is representative of program analysis workloads. This query is de-

rived from the evaluation of RecStep [13], and its Datalog code is

presented in Figure 9.

We conducted our experiments using ten real-world graphs for

both the REACH and SG queries. These datasets were carefully cho-

sen from the Stanford Network Analysis Project network dataset,

SuiteSparse, and road network datasets [12, 30, 31]. These datasets

vary in size and originate from diverse source applications. For

CSPA queries, the input data were provided by the authors of Rec-

Step. It is also used in evaluation of a program-analysis tool named

Grapan [52]. These facts were extracted using method cloning to

include calling-context information from three distinct open-source

applications: the Linux kernel, httpd, and PostgreSQL.

Environment Setup

Before diving into the details of our evaluation, we provide a brief

overview of our experimental environment settings:

GDlog and Other GPU-Based Systems. All our GPU-related code

runs on a server equipped with a 32-core Intel Xeon Gold 6338 CPU,

operating at 2.00GHz, backed by 128GB of DRAM, and features

one NVIDIA A100 80GB PCIe GPU. The NVIDIA driver version

used is 525.78.01. GDlog is compiled with CUDA version 12.0. For

cuDF, we employ version 23.10, the latest version available at the

time of our evaluation. The code for deductive queries in cuDF

and GPUJoin is sourced from its public GitHub repository [45],

employing the same CUDA toolkit as GDlog.

CPU Based System: Soufflé. Our Soufflé experiments are con-

ducted on a dedicated computation node within Polaris high per-

formance supercomputer [3] situated at the Argonne Leadership

Computing Facility (ANLF). This supercomputer’s computation

nodes boasts a top-tier multi-core CPU, the AMD EPYC 7543P,

featuring 32 cores clocked at 2.8GHz and supported by 512GB of

DRAM. Soufflé is configured using the CMake release build mode

6.3

Evaluating Eager Buffer Management

We assess the efficacy of our Eager Buffer Management (EBM)

mechanism by conducting a comparative analysis of runtime and

memory usage with EBM enabled and disabled, using the REACH

9query on five distinct large graphs. Our results are presented in

Table 1. This table is organized into four groups of columns. The

first group specifies the datasets used in the experiment. The sec-

ond group of columns provides insights into the iteration counts

within the REACH query. This group is further divided into two

sub-columns. “Total” denotes the total number of iterations, while

“Tail” refers to the count of tail iterations, where the number of

delta tuples generated in that iteration is less than 1% of the total

tuples in the resulting Reachability relation. Notably, the absence

of a tail iteration number for the “usroads” dataset is due to the

fact that the number of delta tuples generated in every iteration of

the REACH query on this dataset is less than 1%. The third group

of columns displays the running time and speed-up ratio achieved

by enabling Eager Buffer Management (EBM). It reveals that EBM

accelerates all REACH queries, with greater benefits in longer iter-

ations, especially long tail ones. The dataset “usroads” benefits the

most, achieving a 2.3x acceleration.

In a separate analysis of detail buffer allocation time, represented

in Figure 7, it’s evident that in datasets with extremely long tail

relations like “vsp_finan” and “usroads”, buffer allocation speeds

can approach up to 10x faster when EBM is active. Moreover, de-

spite dataset “com-dblp” requiring more peak memory space and

consequently larger buffer sizes, its buffer allocation time is ac-

tually lower than datasets like “vsp_finan”, which have smaller

peak memory usage but longer tails. This observation suggests that

buffer allocation costs are more closely linked to the frequency

of buffer allocation operations than the total memory allocation

size. Therefore, the additional memory space utilized by EBM does

not incur a running time penalty, further substantiating EBM as a

reasonable speed-up strategy.

From a memory perspective, Table 1 highlights that EBM in-

troduces consistent memory overhead, maintaining peak memory

usage at around 1.35x the size of EBM-disabled runs. This demon-

strates that EBM consistently adds memory overhead. However,

the difference in the speedup-to-memory overhead ratio suggests

that while EBM is a valuable strategy when buffer allocation is a

bottleneck, it may not be as cost-effective in low iteration scenarios.

Table 2: Reachability: GDlog vs. Soufflé (using 32 cores),

GPUJoin, and cuDF on large graphs (out of memory marked

as OOM).

Dataset

name Reachable

edges GDlog

com-dblp

fe_ocean

vsp_finan

Gnutella31

fe_body

SF.cedge 1.91B

1.67B

910M

884M

156M

80M 21.83

40.73

39.02

5.58

3.82

3.22

Time (s)

Soufflé GPUJoin cuDF

232.99

292.15

239.33

96.82

23.40

33.27 OOM

OOM

64.29

OOM

132.39

180.73

OOM

46.92

12.21

Table 3: Same Generation (SG): GDlog vs. Soufflé (using 32

threads)and cuDF.

6.4

Dataset SG size time (second)

GDlog Soufflé cuDF

fe_body

loc-Brightkite

fe_sphere

SF.cedges

CA-HepTH

ego-Facebook 408M

92.3M

205M

382M

74M

15M 11.67

6.52

4.90

15.78

2.85

1.23

74.26

48.18

48.12

68.88

20.12

17.01

OOM

21.24

19.07

GDlog vs. preexisting GPU-based joins

In this section, we evaluate the performance of the REACH query

using GDlog, GPUJoin, and cuDF, with souffle serving as the base-

line for mature deductive engines. The results are summarized in

Table 2.

The datasets used in the experiments represent a diverse array of

sources, ensuring a comprehensive assessment that spans scientific

communities, P2P networks, random graphs, and road networks.

This dataset diversity helps mitigate any bias towards specific graph

categories. The dataset sizes cover a broad spectrum of problem

complexities.

GDlog consistently emerges as the fastest engine in all test cases.

Compared to GPUJoin, GDlog exhibits a remarkable speedup, with

ratios often exceeding 3x. In the “fe_body” testcase, the speedup ra-

tio reaches an impressive 12.3x. Notably, GPUJoin, designed specif-

ically for reachability queries, stores output relations in raw arrays.

This specialization suggests that GDlog could achieve even higher

speedup ratios when the method of GPUJoin is applied to general

deductive queries.

It is worth mentioning that GPUJoin also experienced two out-

of-memory(OOM) errors during testing. This is attributed to its use

of a pure open addressing hashmap for storing relations, necessi-

tating a low hashtable loading factor to facilitate joins. In contrast,

GDlog’s HISA datastructure employs hash tables solely for accel-

erating range fetching for indexed relations, resulting in efficient

access even with a higher load factor (0.8). Consequently, GDlog

outperforms GPUJoin in memory perspective and avoid falling into

OOM.

Figure 7: Buffer allocation time with/without eager buffer

management. (Normal is when Eager Buffer Management is

disabled, Eager is when Eager Buffer Management is enabled)

10Table 4: Context-Sensitive Program Analysis (CSPA): GDlog vs. Soufflé

(using 32 threads); input data from [5].

Dataset

Name Input Relation

Size Output Relation

Size Httpd Assign: 3.62e5

Dereference: 1.14e6 ValueFlow: 1.36e6

ValueAlias: 2.34e8

MemoryAlias:8.89e7 4.58 49.48 10.2x

Linux Assign: 1.98e6

Dereference: 7.50e6 ValueFlow: 5.50e6

ValueAlias: 2.23e7

MemoryAlias:8.84e7 1.16 13.44 11.7x

PostgreSQL Assign: 1.20e6

Dereference: 3.46e6 ValueFlow: 3.71e6

ValueAlias: 2.23e8

MemoryAlias:8.84e7 5.09 57.82 11.3x

ValueFlow(𝑥, 𝑦)

ValueAlias(𝑥, 𝑦)

ValueFlow(𝑥, 𝑦)

MemoryAlias(𝑥, 𝑤) ←

←

ValueAlias(𝑥, 𝑦) ←

ValueFlow(𝑦, 𝑥)

ValueFlow(𝑥, 𝑥)

MemoryAlias(𝑥, 𝑥)

MemoryAlias(𝑥, 𝑥) ←

←

Time(s)

GDlog Soufflé

Speedup

using Soufflé Datalog. To evaluate GDlog’s ability to enable large-

scale program analysis tasks, we have implemented the context-

sensitive program analysis (CSPA), reproducing the experiments of

Graspan [52]. The full CSPA query is shown in Figure 9: the EDB

relations assign and dereference are assembled by a parser which

preprocesses source programs; for our experiments we use the data

available from Fan et al. The query implements an Andersen-style

data flow analysis, initializing value flow edges based on syntactic

assignments present in the program (bottom five rules) and then

recursively propagating may-alias information (top five rules).

Using the available EDBs from Graspan, we ran CSPA on three

representative evaluation programs: Apache httpd, a significant

subset of the Linux kernel (excluding many drivers), and Post-

greSQL. The results of our evaluation are presented in Table 4. The

table details the size of input and output relations (second and

third columns) and total query running time (fourth column). We

validated that all relation sizes match that of Souffé exactly. The

large tuple sizes in both input and output make the query highly

data-intensive, representing an attractive candidate for SIMD-based

parallelization.

In sum, we see a roughly 10× speedup versus Soufflé under ideal

conditions (32 threads with all optimizations), demonstrating the

significant performance advantage of GDlog over the CPU-based

Soufflé. To gain a deeper understanding of the running time charac-

teristics of GDlog and figuring out the running time bottleneck in

the context of the CSPA query, we conducted a detailed breakdown

of the CSPA’s running time, as shown in Figure 8. In this figure,

“Deduplication” accounts for the time spent on removing tuples

that already exist in the full relation. “Indexing Full/Delta” repre-

sents the time required for creating hash indexing within GDlog’s

hybrid data structure. “Merge Delta/Full” captures the time spent

on inserting all tuples from the delta relation into the full relation

using GPU merge. Finally, "Join" represents the actual time spent

on performing join operations across all relations.

For a more detailed understanding of the running time charac-

teristics of GDlog in the context of the Context-Sensitive Program

Analysis (CSPA) query, we conducted a further analysis of the

CSPA’s running time. The results are presented in Figure 8. As we

can see, the majority of the time in the CSPA query using GDlog is

allocated to the merge and join operations. As discussed previously,

ValueFlow(𝑥, 𝑧), ValueFlow(𝑧, 𝑦).

ValueFlow(𝑧, 𝑥), ValueFlow(𝑧, 𝑦).

assign(𝑥, 𝑧), MemoryAlias(𝑧, 𝑦).

dereference(𝑦, 𝑥), ValueAlias(𝑦, 𝑧),

dereference(𝑧, 𝑤).

ValueFlow(𝑧, 𝑥), MemoryAlias(𝑧, 𝑤),

ValueFlow(𝑤, 𝑦).

assign(𝑦, 𝑥).

assign(𝑥, 𝑦).

assign(𝑦, 𝑥).

assign(𝑥, 𝑦).

Figure 9: Datalog code for Context-Sensitive Program Analy-

sis (CSPA). assign and dereference constitute EDB relations.

In contrast with GDlog, cuDF’s implementation yields unsat-

isfactory results, even when compared to the baseline CPU-based

approach. It’s complex code base makes it challenging to perform

a detailed memory profile, but the presence of numerous out-of-

memory errors during experiments points to memory-related per-

formance issues. High memory usage significantly affects memory

allocation and deallocation for buffer and tuple storage, contribut-

ing to slow running times.

We present another test case, SG, to further validate our pre-

vious discussions. The results are presented in Table 3. The SG

query involves k-ary joins and represents more general use cases.

As GPUJoin does not support the SG query, we exclusively com-

pare GDlog with cuDF. Across all six test cases, cuDF encounters

out-of-memory (OOM) errors four times, while the non-OOM test

cases, although approaching the baseline performance of Soufflé,

remain considerably slower than GDlog. GDlog consistently deliv-

ers stable performance, running nearly 7x faster than the baseline

in most of cases, while not experiencing any OOM error.

6.5

Figure 8: Running time detail analysis of CSPA query

using GDlog on various real world datasets

Program Analysis Performance in GDlog

Program analysis is one of the most popular applications of de-

ductive analytics, with several state-of-the-art program analysis

engines (DOOP [9], cclyzer [7], and ddisasm [14]) implemented

11both of these components are highly data-intensive and can be

significantly accelerated by leveraging GPU-based parallel sort and

merge algorithms. We thus feel these results speak to the potential

of GPU-based deductive analytic engines to scale to whole-program

static analysis workloads.

Furthermore, a detailed examination of the CPU utilization ratio

of Soufflé revealed that, despite utilizing all 32 CPU cores dur-

ing computation, the efficiency of Soufflé was relatively low, with

CPU ratios of 571%, 449%, and 683% for the httpd, Linux, and Post-

greSQL datasets, respectively. This discrepancy is primarily due

to the MIMD nature of multi-core-based engines, which struggle

to effectively parallelize program-analysis queries with large vol-

umes of data. In contrast, GPU-based deductive engines like GDlog

have significant potential in this area and can be harnessed for

high-precision program-analysis queries on extensive open-source

projects.

[9] Martin Bravenboer and Yannis Smaragdakis. 2009. Strictly declarative specifica-

tion of sophisticated points-to analyses. In Proceedings of the 24th ACM SIGPLAN

conference on Object oriented programming systems languages and applications.

243–262.

https://github.com/NVIDIA/

[10] Nvidia Corporation. 2019. cuCollections.

cuCollections

[11] Chris J Date. 1989. A Guide to the SQL Standard. Addison-Wesley Longman

Publishing Co., Inc.

[12] Timothy A. Davis and Yifan Hu. 2011. The University of Florida Sparse Matrix

Collection. ACM Trans. Math. Softw. 38, 1, Article 1 (dec 2011), 25 pages. https:

//doi.org/10.1145/2049662.2049663

[13] Zhiwei Fan, Jianqiao Zhu, Zuyu Zhang, Aws Albarghouthi, Paraschos Koutris,

and Jignesh M. Patel. 2019. Scaling-up in-Memory Datalog Processing: Ob-

servations and Techniques. Proc. VLDB Endow. 12, 6 (feb 2019), 695–708.

https://doi.org/10.14778/3311880.3311886

[14] Antonio Flores-Montoya and Eric Schulte. 2020. Datalog disassembly. In 29th

USENIX Security Symposium (USENIX Security 20). 1075–1092.

[15] Oded Green. 2021. HashGraph—Scalable Hash Tables Using a Sparse Graph

Data Structure. ACM Trans. Parallel Comput. 8, 2, Article 11 (jul 2021), 17 pages.

https://doi.org/10.1145/3460872

[16] Oded Green, Robert McColl, and David A Bader. 2012. GPU merge path: a GPU

merging algorithm. In Proceedings of the 26th ACM international conference on

Supercomputing. 331–340.

[17] Todd J Green. 2015. Logiql: A declarative language for enterprise applications.

In Proceedings of the 34th ACM SIGMOD-SIGACT-SIGAI Symposium on Principles

of Database Systems. 59–64.

[18] Tobias Groth, Sven Groppe, Thilo Pionteck, Franz Valdiek, and Martin Koppehel.

2023. Hybrid CPU/GPU/APU accelerated query, insert, update and erase oper-

ations in hash tables with string keys. Knowledge and Information Systems 65

(2023), 4359 – 4377. https://api.semanticscholar.org/CorpusID:258940152

[19] Jiaqi Gu, Yugo H Watanabe, William A Mazza, Alexander Shkapsky, Mohan Yang,

Ling Ding, and Carlo Zaniolo. 2019. RaSQL: Greater power and performance for

big data analytics with recursive-aggregate-SQL on Spark. In Proceedings of the

2019 International Conference on Management of Data. 467–484.

[20] Bingsheng He, Mian Lu, Ke Yang, Rui Fang, Naga K Govindaraju, Qiong Luo,

and Pedro V Sander. 2009. Relational query coprocessing on graphics processors.

ACM Transactions on Database Systems (TODS) 34, 4 (2009), 1–39.

[21] Herbert Jordan, Bernhard Scholz, and Pavle Subotić. 2016. Soufflé: On synthesis of

program analyzers. In Computer Aided Verification: 28th International Conference,

CAV 2016, Toronto, ON, Canada, July 17-23, 2016, Proceedings, Part II 28. Springer,

422–430.

[22] Herbert Jordan, Pavle Subotić, David Zhao, and Bernhard Scholz. 2019. Brie: A

specialized trie for concurrent datalog. In Proceedings of the 10th International

Workshop on Programming Models and Applications for Multicores and Manycores.

31–40.

[23] Herbert Jordan, Pavle Subotić, David Zhao, and Bernhard Scholz. 2019. A spe-

cialized B-tree for concurrent datalog evaluation. In Proceedings of the 24th

symposium on principles and practice of parallel programming. 327–339.

[24] Daniel Jünger, Robin Kobus, André Müller, Christian Hundt, Kai Xu, Weiguo Liu,

and Bertil Schmidt. 2020. WarpCore: A Library for fast Hash Tables on GPUs.

2020 IEEE 27th International Conference on High Performance Computing, Data,

and Analytics (HiPC) (2020), 11–20. https://api.semanticscholar.org/CorpusID:

221761157

[25] Daniel Jünger, Christian Hundt, and Bertil Schmidt. 2018. WarpDrive: Massively

Parallel Hashing on Multi-GPU Nodes. In 2018 IEEE International Parallel and

Distributed Processing Symposium (IPDPS). 441–450. https://doi.org/10.1109/

IPDPS.2018.00054

[26] Tomas Karnagel, René Müller, and Guy M Lohman. 2015. Optimizing GPU-

accelerated Group-By and Aggregation. ADMS@ VLDB 8 (2015), 20.

[27] Changkyu Kim, Tim Kaldewey, Victor W Lee, Eric Sedlar, Anthony D Nguyen,

Nadathur Satish, Jatin Chhugani, Andrea Di Blas, and Pradeep Dubey. 2009.

Sort vs. hash revisited: Fast join implementation on modern multi-core CPUs.

Proceedings of the VLDB Endowment 2, 2 (2009), 1378–1389.

[28] Sidharth Kumar and Thomas Gilray. 2020. Load-balancing parallel relational

algebra. In High Performance Computing: 35th International Conference, ISC High

Performance 2020, Frankfurt/Main, Germany, June 22–25, 2020, Proceedings 35.

Springer, 288–308.

[29] Rubao Lee, Minghong Zhou, Chi Li, Shenggang Hu, Jianping Teng, Dongyang

Li, and Xiaodong Zhang. 2021. The art of balance: a RateupDB™ experience

of building a CPU/GPU hybrid database product. Proceedings of the VLDB

Endowment 14, 12 (2021), 2999–3013.

[30] Jure Leskovec and Andrej Krevl. 2014. SNAP Datasets: Stanford Large Network

Dataset Collection. http://snap.stanford.edu/data.

[31] Feifei Li, Dihan Cheng, Marios Hadjieleftheriou, George Kollios, and Shang-

Hua Teng. 2005. On trip planning queries in spatial databases. In International

symposium on spatial and temporal databases. Springer, 273–290.

CONCLUSION

We presented GDlog, a new GPU-based system which achieves

an order of magnitude performance improvement for large-scale

deductive analytic queries. GDlog is designed to compete with

state-of-the-art CPU and GPU-based Datalog engines, and employs

the same techniques as modern high-performance engines (e.g.,

semi-naïve evaluation, ubiqutuous indexing, and range querying)

to ensure optimal algorithmic complexity. Additionally, GDlog

makes use of HISA, a novel data structure we developed to support

high-performance Datalog engines on the GPU. HISA combines the

algorithmic necessitites of modern Datalogs while effectively lever-

aging the massive SIMD parallelism available to modern GPUs. We

then used HISA to build GDlog, employing two novel design points

to help achieve optimal efficiency for GPU-based deductive engines:

temporarily-materialized 𝑘-ary joins and eager buffer management.

We have evaluated GDlog extensively compared to both CPU and

GPU-based state-of-the-art; our results demonstrate improvements

of up to 10× on large-scale deductive analytic workloads.

REFERENCES

[1] Dan A. Alcantara, Vasily Volkov, Shubhabrata Sengupta, Michael Mitzenmacher,

John D. Owens, and Nina Amenta. 2012. Chapter 4 - Building an Efficient Hash

Table on the GPU. In GPU Computing Gems Jade Edition, Wen mei W. Hwu (Ed.).

Morgan Kaufmann, Boston, 39–53. https://doi.org/10.1016/B978-0-12-385963-

1.00004-6

[2] Molham Aref, Balder ten Cate, Todd J Green, Benny Kimelfeld, Dan Olteanu,

Emir Pasalic, Todd L Veldhuizen, and Geoffrey Washburn. 2015. Design and

implementation of the LogicBlox system. In Proceedings of the 2015 ACM SIGMOD

International Conference on Management of Data. 1371–1382.

[3] Argonne Leadership Computing Facility. 2023. Polaris Supercomputer. https:

//www.alcf.anl.gov/polaris

[4] Saman Ashkiani, Martin Farach-Colton, and John D. Owens. 2018. A Dynamic

Hash Table for the GPU. In 2018 IEEE International Parallel and Distributed

Processing Symposium (IPDPS). 419–429. https://doi.org/10.1109/IPDPS.2018.

00052

[5] Muhammad A Awad, Saman Ashkiani, Rob Johnson, Martín Farach-Colton, and

John D Owens. 2019. Engineering a high-performance gpu b-tree. In Proceedings

of the 24th symposium on principles and practice of parallel programming. 145–157.

[6] Muhammad A. Awad, Saman Ashkiani, Serban D. Porumbescu, Martín

Farach-Colton, and John D. Owens. 2022.

Better GPU Hash Tables.

arXiv:2108.07232 [cs.DS]

[7] George Balatsouras and Yannis Smaragdakis. 2016. Structure-sensitive points-to

analysis for C and C++. In Static Analysis: 23rd International Symposium, SAS

2016, Edinburgh, UK, September 8-10, 2016, Proceedings 23. Springer, 84–104.

[8] Daniel Bartholomew. 2017. Basics of Common Table Expressions. Apress, Berkeley,

CA, 3–9. https://doi.org/10.1007/978-1-4842-3120-3_1

12[32] Yuchen Li, Qiwei Zhu, Zheng Lyu, Zhongdong Huang, and Jianling Sun. 2021.

DyCuckoo: Dynamic Hash Tables on GPUs. In 2021 IEEE 37th International Con-

ference on Data Engineering (ICDE). 744–755. https://doi.org/10.1109/ICDE51399.

2021.00070

[33] Carlos Alberto Martínez-Angeles, Inês Dutra, Vítor Santos Costa, and Jorge

Buenabad-Chávez. 2013. A datalog engine for gpus. In International Conference

on Applications of Declarative Programming and Knowledge Management. Springer,

152–168.

[34] Saher Odeh, Oded Green, Zahi Mwassi, Oz Shmueli, and Yitzhak Birk. 2012.

Merge path-parallel merging made simple. In 2012 IEEE 26th International Parallel

and Distributed Processing Symposium Workshops & PhD Forum. IEEE, 1611–1618.

[35] Jun Rao and Kenneth A Ross. 1999. Cache Conscious Indexing for Decision-

Support in Main Memory. In Proceedings of the 25th International Conference on

Very Large Data Bases. 78–89.

[36] RAPIDS Development Team. 2021. cuDF: GPU DataFrame Library. https://github.

com/rapidsai/cudf.

[37] Daniel Ritter and Till Westmann. 2012. Business Network Reconstruction Using

Datalog. In Datalog in Academia and Industry, Pablo Barceló and Reinhard Pichler

(Eds.). Springer Berlin Heidelberg, Berlin, Heidelberg, 148–152.

[38] Ran Rui, Hao Li, and Yi-Cheng Tu. 2015. Join algorithms on GPUs: A revisit

after seven years. In 2015 IEEE International Conference on Big Data (Big Data).

2541–2550. https://doi.org/10.1109/BigData.2015.7364051

[39] Ran Rui, Hao Li, and Yi-Cheng Tu. 2020. Efficient Join Algorithms for Large

Database Tables in a Multi-GPU Environment. Proc. VLDB Endow. 14, 4 (Dec.

2020), 708–720. https://doi.org/10.14778/3436905.3436927

[40] Ran Rui and Yi-Cheng Tu. 2017. Fast Equi-Join Algorithms on GPUs: Design and

Implementation. In Proceedings of the 29th International Conference on Scientific

and Statistical Database Management (Chicago, IL, USA) (SSDBM ’17). Association

for Computing Machinery, New York, NY, USA, Article 17, 12 pages. https:

//doi.org/10.1145/3085504.3085521

[41] Konstantinos Sagonas, Terrance Swift, and David S Warren. 1994. XSB as an

efficient deductive database engine. ACM SIGMOD Record 23, 2 (1994), 442–453.

[42] Arash Sahebolamri, Langston Barrett, Scott Moore, and Kristopher Micinski.

2023. Bring Your Own Data Structures to Datalog. Proc. ACM Program. Lang. 7,

OOPSLA2, Article 264 (oct 2023), 26 pages. https://doi.org/10.1145/3622840

[43] Jiwon Seo, Jongsoo Park, Jaeho Shin, and Monica S. Lam. 2013. Distributed

Socialite: A Datalog-Based Language for Large-Scale Graph Analysis. Proc. VLDB

Endow. 6, 14 (sep 2013), 1906–1917. https://doi.org/10.14778/2556549.2556572

[44] Alexander Shkapsky, Mohan Yang, Matteo Interlandi, Hsuan Chiu, Tyson Condie,

and Carlo Zaniolo. 2016. Big Data Analytics with Datalog Queries on Spark.

In Proceedings of the 2016 International Conference on Management of Data (San

Francisco, California, USA) (SIGMOD ’16). Association for Computing Machinery,

New York, NY, USA, 1135–1149. https://doi.org/10.1145/2882903.2915229

[45] Ahmedur Rahman Shovon. 2023. Public Github Repository of GPUJoin. https:

//github.com/harp-lab/usenixATC23.

[46] Ahmedur Rahman Shovon, Landon Richard Dyken, Oded Green, Thomas Gilray,

and Sidharth Kumar. 2022. Accelerating Datalog applications with cuDF. In 2022

IEEE/ACM Workshop on Irregular Applications: Architectures and Algorithms (IA3).

IEEE, 41–45.

[47] Ahmedur Rahman Shovon, Thomas Gilray, Kristopher Micinski, and Sidharth

Kumar. 2023. Towards iterative relational algebra on the { GPU } . In 2023 USENIX

Annual Technical Conference (USENIX ATC 23). 1009–1016.

[48] Panagiotis Sioulas, Periklis Chrysogelos, Manos Karpathiotakis, Raja Ap-

puswamy, and Anastasia Ailamaki. 2019. Hardware-Conscious Hash-Joins on

GPUs. In 2019 IEEE 35th International Conference on Data Engineering (ICDE).

698–709. https://doi.org/10.1109/ICDE.2019.00068

[49] Elias Stehle and Hans-Arno Jacobsen. 2017. A memory bandwidth-efficient hybrid

radix sort on gpus. In Proceedings of the 2017 ACM International Conference on

Management of Data. 417–432.

[50] Pavle Subotić, Herbert Jordan, Lijun Chang, Alan Fekete, and Bernhard Scholz.

2018. Automatic Index Selection for Large-Scale Datalog Computation. Proc.

VLDB Endow. 12, 2 (oct 2018), 141–153. https://doi.org/10.14778/3282495.3282500

[51] Transaction Processing Performance Council. 1992. TPC-H Benchmark. http:

//www.tpc.org/tpch/.

[52] Kai Wang, Aftab Hussain, Zhiqiang Zuo, Guoqing Harry Xu, and Ardalan Amiri

Sani. 2017. Graspan: A Single-machine Disk-based Graph System for Interproce-

dural Static Analyses of Large-scale Systems Code. Proceedings of the Twenty-

Second International Conference on Architectural Support for Programming Lan-

guages and Operating Systems (2017). https://api.semanticscholar.org/CorpusID:

3655179

[53] Haicheng Wu, Gregory Diamos, Tim Sheard, Molham Aref, Sean Baxter, Michael

Garland, and Sudhakar Yalamanchili. 2014. Red fox: An execution environment

for relational query processing on gpus. In Proceedings of Annual IEEE/ACM

International Symposium on Code Generation and Optimization. 44–54.