Summary of Pathfinding Future PIM Architectures Demystifying Commercial PIM Technology

Summary Pathfinding Future PIM Architectures Demystifying Commercial PIM Technology arxiv.org

13,427 words - PDF document - View PDF document

One Line

The document explores the use of PIM architectures, focusing on the UPMEM-PIM programming model that facilitates data sharing and synchronization between threads within the same system.

Slides

Slide Presentation (10 slides)

Copy slides outline Copy embed code Download as Word

Pathfinding Future PIM Architectures Demystifying Commercial PIM Technology

Source: arxiv.org - PDF - 13,427 words - view

PIM Architectures: From Research to Implementation

• PIM architectures have been researched for decades but face implementation challenges

• UPMEM-PIM follows the SPMD paradigm for data sharing among threads within the same DPU

Visual: Comparison of traditional architecture vs. PIM architecture

UPMEM-PIM Programming Model: Enabling Data Sharing

• UPMEM-PIM allows for data sharing and synchronization among threads in the same DPU

• Threads in different DPUs cannot directly share data or synchronize

Visual: UPMEM-PIM programming model diagram

PIMulator: Simulating UPMEM ISA-Compatible Instructions

• PIMulator is a simulation framework for executing UPMEM ISA-compatible instructions

• It includes a compiler toolchain and a hardware performance simulator

Visual: PIMulator framework diagram

Performance Analysis of PrIM Benchmarks

• PrIM benchmarks exhibit compute-bound behavior

• Thread-level parallelism fluctuates during execution

Visual: Performance analysis graph

Optimizing the PIM Memory System

• The memory system of PIM architectures needs optimization for improved performance

• Vector execution and addressing limitations need to be addressed

Visual: Comparison of baseline vs. optimized PIM memory system

Challenges in Multi-Tenancy with Current PIM Chips

• Current commercial PIM chips face limitations in multi-tenancy

• Co-located workloads cannot securely execute without interference

Visual: Multi-tenancy limitations diagram

Cache-Centric PIM Architecture: Performance Benefits

• Cache-centric PIM architecture offers performance benefits over scratchpad-centric design

• Allocation of input data in WRAM improves performance

Visual: Cache-centric PIM architecture diagram

References: Intelligent RAM, Computation Models, and Accelerators

• Various research papers and conference proceedings cover intelligent RAM, computation models, and PIM accelerators

• Topics include practical challenges and CNN inference accelerators

Visual: Collage of research paper covers

Pathfinding Future PIM Architectures

• PIM architectures have potential but face implementation challenges

• Optimizing the memory system and addressing multi-tenancy limitations are crucial

• Remember the importance of PIM in accelerating computation and improving performance

Key Points

PIM (Processing-in-memory) architectures have been researched for decades but have not been widely implemented due to various limitations.
UPMEM-PIM follows the single-program multiple-data (SPMD) paradigm, allowing for data sharing among threads executing within the same DPU.
PIMulator is a simulation framework that supports the execution-driven simulation of UPMEM ISA-compatible instructions.
The performance of PrIM benchmarks showed compute-bound behavior and fluctuating thread-level parallelism during execution.
Optimizing the PIM memory system is crucial for future SIM architectures.
Current commercial PIM chips face limitations in multi-tenancy, preventing secure execution of co-located workloads.
Cache-centric PIM architecture offers performance benefits compared to scratchpad-centric design.
Various research papers and conference proceedings cover topics such as intelligent RAM, computation models, practical challenges, and accelerators for PIM technology.

Summaries

28 word summary

This document examines the implementation of PIM architectures in computer systems, specifically the UPMEM-PIM programming model which enables data sharing and synchronization among threads within the same D.

34 word summary

The document explores the implementation of Processing-in-memory (PIM) architectures in computer systems. The UPMEM-PIM programming model follows the single-program multiple-data (SPMD) paradigm, allowing for data sharing and synchronization among threads within the same D

596 word summary

This summary will highlight the key points from the excerpted text.

The document discusses the exploration of Processing-in-memory (PIM) architectures in computer systems. While PIM has been researched for decades, it has not been widely implemented due to high

The UPMEM-PIM programming model follows the single-program multiple-data (SPMD) paradigm. Programmers write a single program that is executed by all software threads (tasklets), but each thread can have its own control flow and access different parts

UPMEM's scratchpad-centric programming model allows for data sharing and synchronization among threads executing within the same DPU. However, threads executing in different DPUs cannot directly share data or synchronize with each other. To enable data sharing or synchronization across different DP

PIMulator is a simulation framework that supports the execution-driven simulation of UPMEM ISA-compatible, machine-level instructions. It consists of a compiler toolchain and a hardware performance simulator. The compiler toolchain utilizes UPMEM SDK's preprocessor and

The document discusses the Pathfinding Future PIM Architectures and aims to demystify commercial PIM technology. It mentions various datasets, elements, and queries used in the study. The document also discusses the utilization of arbitrary locations in the memory address space

PrIM's compute utilization and memory read bandwidth utilization are shown in Figure 5. The maximum DRAM bandwidth in a real UPMEM-PIM system is around 600 MB/sec. PrIM targets data-intensive workloads and shifts the performance bottleneck

The runtime performance of PrIM benchmarks was analyzed, showing that most benchmarks have a compute-bound behavior. The number of issuable threads and the instruction mix were also examined. PIMulator was used to analyze how thread-level parallelism fluctuates during execution

The baseline UPMEM-PIM architecture employs a scalar processor with thread-level parallelism. However, its memory system does not meet the requirements for vector execution, resulting in limited speedup. Optimizing the PIM memory system is crucial for future SIM

Current commercial PIM chips, including UPMEM-PIM, are not able to meet the requirements of multi-tenancy due to limitations in hardware/software and programming models. Co-located workloads cannot securely execute without interfering with each other, and

The performance overhead of adding address translations to a scratchpad-centric PIM architecture is low, with an average loss of 0.8% and a maximum loss of 14.1%. This is due to the scratchpad-centric memory model's high

This excerpt discusses the performance and benefits of a cache-centric PIM architecture compared to a scratchpad-centric design. The PIMulator framework is introduced, which emulates the cache-centric UPMEM-PIM by directly allocating input data in the WRAM

This summary provides a list of references to various research papers and conference proceedings related to Processing-in-Memory (PIM) technology. These references cover topics such as intelligent RAM, computation models for intelligent memory, the architecture of PIM chips, practical challenges

This excerpt includes a list of references to various research papers and articles related to the topic of future PIM architectures. Some key points include the use of in-memory processing in HBM2-PIM and LPDDR5-PIM technologies, the development of

This text excerpt contains a list of references to various research papers, conference proceedings, and technical documents related to Processing-in-Memory (PIM) architectures and accelerators. The references cover a range of topics including DRAM-based accelerators for CNN inference

This excerpt includes a list of references to various research papers and documents related to PIM (Processing-In-Memory) architectures and technologies. The references cover a range of topics including GPU simulation, memory access scheduling, software developer's manuals, benchmarks, managing

This excerpt is a list of references to various papers and articles related to processing-in-memory (PIM) architectures. The references include papers that discuss topics such as efficient synchronization support for near-data-processing architectures, resource management of latency-critical applications in clouds

Raw indexed text (88,967 chars / 13,427 words / 2,095 lines)

Pathfinding Future PIM Architectures by

Demystifying a Commercial PIM Technology

Bongjoon Hyun

Taehun Kim

Dongjae Lee

Minsoo Rhu

KAIST

{bongjoon.hyun, taehun.kim, dongjae.lee, mrhu}@kaist.ac.kr

Abstract—Processing-in-memory (PIM) has been explored for

decades by computer architects, yet it has never seen the light of

day in real-world products due to their high design overheads and

lack of a killer application. With the advent of critical memory-

intensive workloads, several commercial PIM technologies have

been introduced to the market ranging from domain-specific PIM

architectures to more general-purpose PIM architectures. In this

work, we deepdive into UPMEM’s commercial PIM technology, a

general-purpose PIM-enabled parallel architecture that is highly

programmable. Our first key contribution is the development

of a flexible simulation framework for PIM. The simulator we

developed (aka PIMulator) enables the compilation of UPMEM-

PIM source codes into its compiled machine-level instructions,

which are subsequently consumed by our cycle-level performance

simulator. Using PIMulator, we demystify UPMEM’s PIM design

through a detailed characterization study. Building on top of our

characterization, we conduct a series of case studies to pathfind

important architectural features that we deem will be critical

for future PIM architectures to support.

I. I NTRODUCTION

“We’ve investigated applying PIM to our workloads

and determined there are several challenges to using

these approaches. Perhaps the biggest challenge of

PIM is its programmability. It is hard to anticipate

future model compression methods, so programma-

bility is required to adapt to these. PIM must also

support flexible parallelization since it is hard to

predict how much each dimension (of embedding

tables) will scale in the future.”

“First-Generation Inference Accelerator Deploy-

ment at Facebook”, Facebook, 2021 [1]

Emerging workloads in the areas of scientific computing,

graph processing, and machine learning pose unprecedented

demand for larger data. However, the well-known memory

bandwidth wall causes a critical performance bottleneck

for these memory-bound workloads, due to the widening

performance gap between processor and memory. Processing-

in-memory (PIM) architectures have been explored extensively

for decades [2], [3], [4], [5], as they help alleviate the memory

bandwidth bottleneck by moving compute logic closer to

memory. Unfortunately, the computing industry has so far been

lukewarm in commercializing PIM architectures, primarily

because of its high design overheads (e.g., regression in DRAM

density, thermal issues [6]) and its intrusiveness to the software

Under review.

stack (e.g., programming model [7], [8], [9], [10], managing

address space and data coherence [11], [12], [13]), rendering

PIM mostly an academic pursuit.

Interestingly, such sentiment towards PIM has changed

dramatically over the past couple of years with several

commercial PIM systems introduced to the market. These

PIM designs can broadly be grouped into two categories: 1)

domain-specific PIM and 2) general-purpose PIM. Domain-

specific PIM designs have been driven by key memory vendors

like Samsung [14], [15], [16], [17] and SK Hynix [18], [19],

which focus on specializing its PIM design by supporting key

compute primitives for a targeted application domain (e.g.,

matrix multiplication for accelerating deep neural networks),

), reigniting people’s interest in PIM designs [14], [17], [18],

[20], [21], [22], [23], [24], [25], [26], [27], [28], [29], [30],

[31], [32], [33], [34], [35], [36], [37], [38], [39], [40], [41],

[42]. At the other end of the spectrum, the PIM solution offered

by UPMEM [43] (henceforth referred to as UPMEM-PIM)

takes a different approach by providing a general-purpose

parallel programming language with an LLVM-based compiler

stack [44], [45], providing application developers the flexibility

to write any parallel program to run using PIM. Thanks to

its high programmability and flexibility, several recent work

studied the applicability of UPMEM-PIM for accelerating a

variety of application domains, e.g., graph algorithms, machine

learning, bioinformatics, etc. [46], [47], [48], [49], [50].

Similar to how GPUs have transformed itself into a first-class

computing citizen after years of hardware/software refinement,

we believe that it is possible for such general-purpose PIM

design to similarly evolve into an important computing device

(or at a minimum provide valuable insights in designing future

general-purpose PIM) as its hardware/software stack matures.

Given this landscape, our key motivation is to demystify

industry’s first general-purpose PIM design through a detailed

characterization study, understanding the unique properties of

UPMEM-PIM and identify important research domains that

computer architects can explore. To this end, we first develop

an UPMEM-PIM ISA compatible simulation framework that

utilizes UPMEM’s open-source compiler stack to compile any

UPMEM-PIM program, from its C-level source code down

to its machine level instructions. The compiled UPMEM-

PIM binary is then consumed by our cycle-level hardware

performance simulator, which we carefully cross-validate

against a real UPMEM-PIM system (Section III). Simulators

are, by design, immensely flexible and customizable, soPIM Chip

PIM

PIM Chip

Chip

PIM Chip

PIM

PIM Chip

Chip

Atomic

IRAM

WRAM

Memory Bus

DDR4

Interface

DRAM Processing Unit (DPU)

Chip

PIM Chip

Control / Status

Interface

PIM

PIM Chip

Chip

PIM Chip

PIM

PIM Chip

DRAM

PIM

PIM Chip

Chip

PIM Chip

DRAM

PIM

PIM Chip

Chip

CPU

(Host)

PIM Chip

DRAM

Chip

PIM Chip

DRAM

Bank

(MRAM)

DRAM

PIM Chip

Chip

it enables us to understand the fine-grained details of the

runtime execution of a real (UPMEM) PIM program. Using

our PIM simulator (henceforth referred to as PIMulator),

we conduct a workload characterization study and provide a

number of interesting insights that cannot be easily uncovered

using UPMEM-PIM chip’s hardware performance counters or

profiling tools (Section IV). Building on top of our characteri-

zation study, we present several case-studies on extending our

PIMulator to pathfind important research areas that require

further attention from computer architects (Section V). Overall,

we expect our in-depth exploration of UPMEM-PIM using our

versatile PIMulator to open up important research directions

for computer system designers, paving the way for PIM to

evolve into a truly general-purpose computing device.

Fig. 1: UPMEM-PIM hardware system overview.

B. Programming Model

UPMEM-PIM follows the single-program multiple-data

(SPMD) programming paradigm. A single program written

by the programmer gets executed by all the software threads

(i.e., tasklets) that are instantiated, but each individual thread

can take its own control flow and access different parts of a

data using its thread ID (tasklet ID). Since there can be up to

2,560 DPUs and 24 threads per DPU, the programmer must

carefully partition the input data, not only across the DPUs

(Figure 2(a), line 18-20, 22-24, and 26-29) but also across

the threads within each DPU (Figure 2(b), line 22-29). We

use Figure 2 as a running example to highlight some of the

important programming semantics of UPMEM-PIM.

Host-side programming. Any program that is written in

UPMEM’s C-like programming language can be compiled

into its machine code by using the LLVM [44]-based compiler

toolchain developed by UPMEM [45]. Similar to NVIDIA’s

CUDA [54], UPMEM-PIM follows the co-processor comput-

ing model where the CPU offloads the memory-intensive task

to the DPU, functioning as an arbiter of the PIM program’s

execution. Consequently, the UPMEM compiler generates two

binaries, one that runs on the host and the other that runs

across all the DPUs. In the host-side code (Figure 2(a)), the

programmer must (1) allocate the desired number of DPUs

to be used (line 8), (2) offload the program binary to all

the DPUs (line 9), (3) partition and send input data to the

DPU’s scratchpad (line 18-20) and per-bank DRAM (line 22-

24, and 26-29), (4) ask the host to send commands to the

DPUs to execute the PIM program (line 31), and (5) once the

PIM program terminates, retrieve back the results from DPU

memory back to the host CPU’s memory space (line 33-36).

DPU-side programming. A unique aspect of UPMEM-

PIM’s programming model is that all the PIM kernel’s working

set must be staged through DPU’s scratchpad using DMA

instructions. Consider the code snippet in Figure 2(b). Any

thread executing within the DPU can only load (store) data

from (to) the scratchpad (WRAM) address space but it is

not able to address data in the per-bank DRAM (MRAM)

address space directly (line 7, 42). In effect, DPUs operate

over two distinct memory address spaces, the slower but larger

per-bank DRAM region and the faster yet smaller scratchpad

region. Only when the programmer explicitly requests data

movements from the per-bank DRAM region to the scratchpad

region (using DMA instructions via mram_read(), line 39-

40) can the DPU threads access the necessary data from

the scratchpad using load/store instructions (line 9, notice

the pointers to the arrays A,B,C are dynamically allocated

II. UPMEM-PIM A RCHITECTURE

A. Hardware Architecture

System overview. Figure 1 provides a high-level overview

of an UPMEM-PIM based system containing a host-side CPU

communicating with a group of standard regular DIMMs and

another group of PIM-enabled memory DIMMs (UPMEM-

PIM modules). An UPMEM-PIM module is based on a

standard DDR4-2400 [51] DIMM form factor containing

8 UPMEM-PIM DRAM chips per each rank. Within each

UPMEM-PIM DRAM chip, there are 8 DPUs (DRAM

Processing Units), one DPU per each DRAM bank. Each

DPU has direct access to a dedicated 64 MB DRAM bank

(referred to as Main RAM, aka MRAM), a 64 KB SRAM-

based scratchpad memory (aka Working RAM, WRAM), and

24 KB instruction memory (aka Instruction RAM, IRAM).

Before an UPMEM-PIM program (i.e., the PIM kernel) is to

be executed, the host CPU must explicitly offload both the PIM

kernel and the input data from CPU’s conventional memory

address space (i.e., regular DIMMs) to DPU’s UPMEM-PIM

address space. The real PIM system we study in this work

contains 20 double-ranked UPMEM-PIM DIMMs, so a total

of (20 × 2 × 8 × 8)=2,560 DPUs concurrently execute as

co-processors to the CPU.

DPU architecture. The DPU is designed as an in-order

14-stage pipelined processor with a RISC-based ISA, imple-

menting fine-grained multi-threading. A total of 24 threads

(called tasklets by UPMEM) can concurrently execute within

a DPU, all of which share the scratchpad (WRAM), in-

struction memory (IRAM), and per-bank DRAM (MRAM).

The UPMEM DPU has a peculiar thread scheduling rule

where two consecutive instructions within the same thread

must be dispatched 11 cycles apart (UPMEM refers to such

microarchitecture as the revolver pipeline [52]). UPMEM states

that such scheduling constraint is enforced to obviate the need

to implement complicated circuitry for data forwarding and

pipeline interlocks [53]. Another unique aspect of the DPU

microarchitecture is in its register file (RF) design: the RF

is split into an even and odd RF and a thread cannot access

multiple even or odd registers at the same cycle (e.g., r0 and

r2 cannot be accessed at the same cycle) due to a structural

hazard (i.e., RF conflict).

2Processor

Processor

load/store

Scratchpad

(Shared Mem.)

load/store

DRAM

Scratchpad

(WRAM)

DMA

DRAM

(Global Mem.) (MRAM)

(a) CUDA (b) UPMEM-PIM

Atomic

IRAM

MRAM

WRAM

0xF0000100

0xF0000000

0x80008000

0x80000000

0x0C000000

0x08000000

0x00010000

0x00000000

Fig. 3: Memory model of (a) CUDA and (b) UPMEM-PIM. (c) The

(physical) address map of UPMEM-PIM.

DRAM, unlike UPMEM’s memory model which only allows

load/store semantics over the scratchpad (Figure 3(a,b)). In the

remainder of this paper, we refer to such model as UPMEM’s

scratchpad-centric programming model.

Data sharing and synchronization. Threads executing

within the same DPU can share data over the DPU scratchpad

or its local DRAM bank (MRAM). They can also synchronize

with each other by using mutexes, barriers, or semaphores allo-

cated in UPMEM-PIM’s atomic memory region (Figure 3(c)),

all of which are supported by UPMEM’s SDK [56].

However, threads executing in different DPUs cannot share

data or synchronize with each other directly. This is because

1) all the DPUs’ per-bank DRAM across the UPMEM-PIM

DIMM are not virtualized within a single, shared memory

address space (further discussed in Section II-C) 2) nor is

there a direct communication datapath among them. If data

sharing or synchronization across different DPUs is in need,

the programmer must first explicitly copy back the shared data

from the producer DPU’s memory to the CPU memory after

kernel terminates. The CPU can then copy back this shared

data from its CPU memory region to the consumer DPU’s

memory region during the next PIM kernel execution.

(a) Host-side code.

C. System Software

Virtual memory support. UPMEM-PIM does not have

a memory management unit (MMU) to virtualize its phys-

ical memory, so the DPU uses physical addresses when

accessing WRAM, IRAM, and MRAM, as illustrated in

Figure 3(c). In other words, when moving data across

UPMEM-PIM’s memory hierarchy using (1) load/store in-

structions (for scratchpad↔RF) or (2) DMA instructions

(for DRAM↔scratchpad), the memory addresses generated

by executing an instruction are used as-is, without any

address translation process involved (Figure 3(b)). Consider

the example in Figure 2. When the input array B is being

copied from the CPU to DPU’s per-bank DRAM (Figure 2(a),

line 26-29) and then from DPU’s DRAM to DPU’s scratchpad

(Figure 2(b), line 40), the programmer must carefully program

the pointer value to use as the destination (for CPU→DPU’s

DRAM) and source address (for DPU’s DRAM→DPU’s

scratchpad) within per-bank DRAM (MRAM) by using

DPU_MRAM_HEAP_POINTER_NAME (Figure 2(a), line 27-

29) or DPU_MRAM_HEAP_POINTER (Figure 2(b), line 26-

29) as the base physical address.

(b) DPU-side code.

Fig. 2: An element-wise vector addition program written for UPMEM-

PIM: (a) host-side and (b) DPU-side program.

at the scratchpad WRAM via mem_alloc() calls in line

35-37). This is similar to NVIDIA’s CUDA programming

model [54] where the programmer must explicitly orchestrate

data movements across the CPU memory and the GPU memory

using cudaMemcpy() (unless the programmer employs

Unified Memory [55]). CUDA however does allow threads

to directly load (store) from (to) both its scratchpad and its

3Atomic

IRAM

IRAM.bin

Assembly codes

Linker

Assembler

WRAM

WRAM.bin

memcpy:

.Lfunc_begin1:

add r2, r0, r2

.Ltmp5:

xor r3, r1, r0

.Ltmp6:

add r4, r0, 3

Pipeline

MRAM

(Bank)

MRAM.bin

Input data

ISA Extension & Mapping

Parser

Atomic.bin

ALU

DMA

Thread

Scheduler

Memory

Scheduler

UPMEM-DIMM

PIM

Chip

…

PIM

Chip

Initialization

Communication

Microarchitecture Design & Configuration

Lexer

Backend (cycle-level performance simulator)

Frontend (code/data gen)

void *

memcpy(void *dest, const void *src, size_t

len)

{

uint8_t *d = (uint8_t *)dest;

const uint8_t *s = (const uint8_t *)src;

...

}

Library source codes

vector_addition:

.Lfunc_begin0:

jeq r2, 0, .LBB0_2

.Ltmp0:

.LBB0_1:

lw r3, r1, 0

lw r4, r0, 0

add r3, r4, r3

void vector_addition(int *A, int *B, int *C) {

for (int i = 0; i < SIZE; i++)

C[i] = A[i] + B[i];

}

int main() {

vector_addition(A, B, C);

return 0;

}

Application source codes

Finalization

CPU

Fig. 4: PIMulator simulation framework overview.

Overall, the lack of a virtual memory support leaves the

programmer with the burden of reasoning about where the

input (output) data should be copied over to (from) within

DPU’s DRAM, hurting user productivity. Section V-C further

discusses the architectural implication of an MMU-less PIM.

Dynamic memory allocation. UPMEM provides three

distinct APIs for dynamic memory allocation within the

scratchpad (WRAM), mem_alloc(), fsb_alloc(), and

buddy_alloc(). The mem_alloc() (line 35-37 in Fig-

ure 2(b)) is a very simple memory allocator which simply

allocates size amount of region in WRAM’s heap in an

incremental manner. While simple, it is not able to free the

allocated memory once it is allocated. fsb_alloc() also

employs a simple memory allocation policy but it can suffer

from severe heap fragmentation issues. The alternative buddy

memory allocation [57] based buddy_alloc() does a better

job in reducing heap fragmentation but comes at the cost of

higher performance overhead. Dynamic memory allocation

using more sophisticated heap management algorithms like

tcmalloc [58] are not available and we conjecture that

this is due to DPU’s wimpy processing capabilities, causing

high latency overheads with sophisticated memory allocation

policies. Dynamic memory allocation within the DPU’s per-

bank DRAM (MRAM) is also not provided, as highlighted

by the manual derivation of where to copy input data from

CPU to DPU’s per-bank DRAM (Figure 2(a), line 22-29).

finally linking them into an UPMEM-PIM binary executable.

PIMulator utilizes UPMEM SDK’s preprocessor and compiler

as-is to first lower UPMEM program source files into multiple

assembly-level codes. These assembly codes are then fed into

our custom-designed linker (which is based on the open-source

ANTLR’s lexer and parser [59], [60], [61]) to go through the

lexical and syntax analyses to resolve the def-use relationships

of all the functions, code labels, etc. for linking. Finally, our

custom-designed assembler generates the final binary files to

upload into UPMEM-PIM’s atomic (i.e., mutex), IRAM (i.e.,

the UPMEM-PIM program), WRAM, and MRAM (i.e., the

input data) address spaces (Figure 4).

The reason why PIMulator employs a custom-designed

linker and assembler is as follows. We observe that the

current version of UPMEM linker is specifically tied to

UPMEM-PIM’s microarchitecture, preventing us from explor-

ing alternative PIM architectures. For instance, UPMEM’s

linker generates a linking error when the compiled program’s

instruction memory or scratchpad usage exceeds the physical

IRAM or WRAM capacity. As detailed later in Section V-D,

this paper presents a case study to evaluate the trade-offs

of employing an on-demand cache for UPMEM-PIM, as

opposed to UPMEM’s current scratchpad-centric design. Under

UPMEM’s programming model, this requires us to write the

UPMEM-PIM program that has a working set allocated in the

scratchpad (WRAM) space exceeding its 64 KB size, which is

subsequently re-mapped to the per-bank DRAM region in our

cycle-level hardware performance simulator. This allows us

to treat a DPU thread’s load/store transactions to scratchpad

as if they are to DRAM, so plugging in a cache simulator

in between the DPU and scratchpad (which is emulated as

DRAM) enables us to study the performance of caches vs.

scratchpads (Section V-D details our methodology for this

study). None of these features are available with UPMEM’s

current linker design, motivating us to implement our own

linker and assembler for a flexible simulator development and

design exploration.

III. PIM ULATOR S IMULATION F RAMEWORK

Figure 4 provides an overview of PIMulator, which consists

of two key components: (1) a compiler toolchain that supports

execution-driven simulation of UPMEM ISA-compatible,

machine-level instructions, and (2) a hardware performance

simulator cross-validated against a real UPMEM-PIM. To-

gether, these dual components reduce the effort required

to model UPMEM’s general-purpose PIM architecture with

high accuracy, enabling architectural exploration of any PIM

program written with UPMEM’s programming model.

Overall, PIMulator enjoys LLVM’s mature compiler stage

optimizations (e.g., common subexpression elimination [62])

by leveraging UPMEM’s existing preprocessor/compiler as-is

while also enabling diverse architectural explorations through

our custom-designed linker/assembler.

A. Simulator Development

Software compilation toolchain. The open-source UP-

MEM SDK [56] provides an LLVM [44]-based compiler

toolchain [45] (dpu-upmem-dpurte-clang) that takes in

(1) the programmer-written source codes and (2) glibc-style,

UPMEM-PIM compatible C library (e.g., mem_alloc()

for malloc in DPU scratchpad, memcpy(), printf())

to preprocess, compile, and assemble into binary objects,

Hardware performance simulator. We implement a cycle-

level performance simulator of UPMEM DPU by referring to

both UPMEM’s user manual and publicly available information

4TABLE II: PrIM benchmarks configurations used for the characteri-

zation and case studies conducted in this work.

TABLE I: PIMulator default configuration.

DPU processor architecture

Operating frequency

350 MHz

Number of pipeline stages

Number revolver scheduling cycles

IRAM / WRAM / MRAM size

24 KB / 64 KB / 64 MB

Atomic memory size

256 Bits

DRAM system

DDR specification

DDR4-2400 [51]

Memory scheduling policy

FR-FCFS

Row buffer size

1 KB

tRCD, tRAS, tRP, tCL, tBL

16, 39, 16, 16, 4 cycles

Communication

CPU→DPU bandwidth (per rank)

0.296 GB/s per DPU

CPU←DPU bandwidth (per rank)

0.063 GB/s per DPU

Benchmark

BFS

GEMV

HST-L

HST-S

MLP

RED

SCAN-RSS

SCAN-SSA

SEL

SpMV

TRNS

UNI

and discussion about the DPU’s microarchitecture [45], [52],

[53], [56], [63], [64]. As summarized in Table I, the baseline

DPU architecture is modeled as a 14-stage in-order pipelined

processor, faithfully modeling its revolver pipeline scheduling

algorithm and the structural hazard enforced at the odd/even

RF accesses (Section II-A). PIMulator functionally executes

the instructions to update its architectural state, allowing us

to verify the correctness of PIM program’s execution.

As for modeling the DRAM subsystem, rather than employ-

ing a highly accurate cycle-level DRAM simulator [65], [66],

[67], we based our implementation on GPGPU-Sim’s cycle-

level DRAM simulator for fast simulation time [68]. Because

the details of UPMEM-PIM’s memory scheduling policy is

not publicly available, we employ a first-row, first-come-

first-serve (FR-FCFS [69]) algorithm for scheduling memory

transactions. The communication latency of trasferring data

over the CPU↔DPU channel is simulated by employing

a fixed bandwidth model as its communication channel

(i.e., communication latency = transfer size/communication

bandwidth), whose value is carefully tuned by profiling a real

UPMEM-PIM system (Table I).

Because UPMEM-PIM implements the CPU↔DPU com-

munication using Intel AVX read (CPU←DPU) and write

(CPU→DPU) instructions [70], we observe asymmetric

CPU↔DPU communication bandwidth (i.e., the synchronous

AVX reads are lower throughput than the asynchronous AVX

writes), a phenomenon also reported in [46].

Dataset (single DPU)

2K vertices, 15K edges

32K elem., 4K queries

2K x 64, 64 x 1 elem.

128K elem., 256 bins

3 layers, 256 neurons

256 gene sequence

512K elem.

256K elem.

512K elem.

12K x 12K, 80519 elem.

128K elem.

2K elem., 64 queries

512K elem.

1M elem.

Dataset (multiple DPUs)

16K vertices, 120K edges

128K elem., 16K queries

8K x 64, 64 x 1 elem.

512K elem., 256 bins

3 layers, 1K neurons

512 gene sequence

2M elem.

1M elem.

2M elem.

14K x 14K, 316740 elem.

256K elem.

64K elem., 64 queries

2M elem.

4M elem.

arbitrary locations in the memory address space, a feature we

utilize to generate the proper instructions/data in our “cache

vs. scratchpad” case study discussed later in Section V-D

(Figure 14(b)). Similarly, PIMulator’s backend performance

model can also be extended to execute UPMEM-PIM’s SPMD

code over alternative hardware architectures. For instance,

one can maintain the same UPMEM-PIM’s code to execute

over an NVIDIA GPU style SIMD processor architecture by

modifying the backend performance model to be implemented

as a SIMT (single-instruction-multiple-thread) [54] vector

processor microarchitecture model, a case study we conduct

in Section V-A.

C. Simulator Validation

We validate our PIMulator using PrIM [71], an open-source

UPMEM-PIM benchmark suite (Table II). PrIM consists of

16 data-intensive workloads from various application domains

such as linear algebra, graph processing, neural networks, etc.

We verify PIMulator’s functional correctness as well as its

performance correlation to real UPMEM-PIM hardware by

running each individual PrIM benchmark with 1/2/4/8/16/24

threads under various input data sizes, cross-validating both

PIMulator and real UPMEM-PIM’s final output data as well

as its execution time. Among the 16 PrIM benchmarks,

PIMulator was able to compile and simulate 13 workloads

as-is. However, the remaining 3 workloads (BFS, SpMV,

NW) had minor bugs or utilizes undisclosed functions within

the UPMEM SDK, preventing its simulation and debugging

on PIMulator, so we modified these three benchmarks to

provide the same functionality of the original implementation

while being executable on top of PIMulator. As discussed

in Section III-A, the CPU↔DPU transfer (used for inter-

DPU communication) is modeled as a fixed bandwidth model,

so the frequency of inter-DPU communications can affect

the accuracy of PIMulator’s simulated execution time. To

separately analyze the fidelity of PIMulator’s DPU architecture

model and the effect CPU↔DPU communication model has

on system-level simulations, we separately report the validation

results of PIMulator when running the PrIM benchmark suite

(1) with just a single DPU executing without any inter-DPU

B. Simulator Availability and Extensibility

PIMulator is designed to cleanly decouple the SPMD based

frontend code/data generation with the backend performance

model with its modular design (Figure 4). Such design

philosophy is inspired by GPGPU-Sim [68] which similarly

utilizes NVIDIA’s CUDA compiler and PTX assembler at its

frontend to generate CUDA code/data, which is subsequently

consumed by its backend cycle-level GPU microarchitecture

simulator. As such, PIMulator can easily be extended to

model and evaluate architecture designs with alternative

software/hardware architectures (we later demonstrate PIMu-

lator’s extensibility via our case study in Section V). For

instance, one can modify PIMulator’s frontend code/data

generation stage to flexibly map the code/data binaries at

5BFS

GEMV HST-L HST-S MLP

RED SCAN SCAN

RSS SSA

SEL SpMV TRNS

UNI

Issuable

Idle (Memory)

BFS

GEMV HST-L HST-S MLP

Idle (Revolver)

Idle (RF)

25%

0% 50%

25% 50%

100%

75%

50%

25%

100%

75%

Memory (DRAM read bandwidth)

75%

Compute (IPC)

100%

RED SCAN SCAN

RSS SSA

SEL SpMV TRNS

UNI

Fig. 6: Breakdown of DPU’s runtime into active (black) and idle (red,

yellow, blue) cycles. When all the threads are idle, we categorize each

thread’s status based on the reason for its stall, i.e., memory (red),

revolver pipeline scheduling constraint (yellow), and the structural

hazard at the RF (blue).

Fig. 5: PrIM’s compute utilization (left axis) and memory read

bandwidth utilization (right axis) when executing with 1/4/16 threads.

While a DPU’s theoretical maximum DRAM bandwidth is 700

MB/sec, prior work [46] observed that the maximum bandwidth

is around 600 MB/sec in real UPMEM-PIM system. We therefore

configured PIMulator’s DRAM bandwidth accordingly. A single

DPU’s max compute throughput is set as 1 IPC.

DRAM bandwidth utilization. PrIM targets data-intensive

workloads that are traditionally categorized as memory-bound

under von-Neumann CPU/GPU architectures. As such, the

results in Figure 5 highlight the unique value proposition

of PIM vs. CPUs/GPUs, i.e., the performance bottleneck is

now shifted from the memory-bound regime to the compute-

bound territory. We observe similar performance results over

real UPMEM-PIM systems (prior work in [46] report similar

observations), demonstrating the fidelity of our PIMulator.

communication and (2) with multiple DPUs with inter-DPU

communication. For the single DPU validation, we used 710

data points whose execution times are within the range of 500

ms, showing 98.4% correlation against UPMEM-PIM with a

mean absolute error (MAE) of 12.0%. Under the multi-DPU

validation, PIMulator shows 83.6% correlation with MAE of

26.9% under 387 data points, with relatively larger absolute

errors observed when the inter-DPU communication time is

more pronounced.

B. Identifying Bottlenecks

While the workloads in PrIM generally exhibit a compute-

bound behavior, the results in Figure 5 imply that there are

still some performance left on the table. Using PIMulator,

we now root-cause the key bottlenecks in UPMEM-PIM’s

microarchitecture that cause such performance loss.

Latency breakdown. In Figure 6, we breakdown DPU’s

execution time into two categories: (1) the time when the thread

scheduler has one or more threads to issue into the pipeline

(black bar), and (2) when the scheduler is left idle with zero

threads to issue (all non-black bars) because all the threads

are either (2-a) waiting for a memory operation to be serviced,

(2-b) stalled due to the UPMEM’s revolver pipeline scheduling

constraint, or (2-c) stalled due to the structural hazard at the

odd/even register file (see Section II-A for revolver pipeline

& RF hazard). As the number of threads increases, the DPU

scheduler is provided with more thread-level parallelism to

populate its 14-stage pipeline, leading to larger fraction of the

runtime executing instructions. Nonetheless, several workloads

still suffer from non-negligible portion of its execution time

with idle cycles due to memory-side bottlenecks (BS, SpMV),

the structural hazards caused by the revolver pipeline and

odd/even RF (GEMV, HST-S, MLP, RED, TRNS, TS), or

both (BFS, NW, SCAN-RSS, SCAN-SSA, SEL, UNI). While

pipeline stalls due to memory operations are a fundamental

one that cannot be resolved easily through processor-side

optimizations, idle cycles originating from the revolver pipeline

scheduling constraint or odd/even RF hazard is an artifact of

UPMEM-PIM’s specific processor microarchitecture.

Thread-level parallelism (in space and time). To analyze

UPMEM-PIM’s performance bottleneck from a different

dimension, we measure the amount of thread-level parallelism

(TLP) available to the DPU scheduler in space (Figure 7) and

in time (Figure 8). In Figure 7, we categorize the number

of issuable threads available to the DPU scheduler to issue

D. Simulation Rate

Developing a detailed execution-driven simulator often

comes at the expense of increased simulation time. While

PIMulator is not multi-threaded, it achieves an average

simulation rate of 3 KIPS (kilo-instructions-per-second), which

is on par with other popular execution driven simulators like

GPGPU-Sim [68]. Because of UPMEM’s current programming

model and how its communication/synchronization primitive

work (Section II-B), DPUs mostly operate independently as

a standalone processor, so we expect parallelizing PIMulator

with multi-threading will lead to significant simulation rate

improvements. Supporting such feature is left as future work.

IV. D EMYSTIFYING UPMEM-PIM WITH PIM ULATOR

This section utilize PIMulator and the PrIM benchmark

suite [71] to demystify the internal runtime characteristics

of UPMEM-PIM, showcasing the applicability of PIMulator

for architectural exploration. We first focus on simulating

PrIM under a single DPU setting in Section IV-A, identifying

its bottleneck in Section IV-B, and finally discussing multi-

DPU execution with strong scaling in Section IV-C. Table II

summarizes the PrIM benchmarks and its input data sizes

we explore in this paper. Due to space constraints, when

sweeping the number of threads that execute a given PrIM

benchmark (collected over 1/2/4/8/16/24 threads), we only

show the results with 1/4/16 threads for brevity.

A. Analyzing Runtime Performance

Figure 5 shows the compute and memory bandwidth

utilization as a function of the number of concurrent threads

instantiated (1/4/16 threads). With the exception of BS and

SpMV, PrIM benchmarks generally exhibit a compute-bound

behavior, having a relatively higher compute utilization than

Time

(a) BS

Time

(b) GEMV

Number

issuable

Multiply, divide

etc.

GEMV HST-L HST-S MLP

Load/store to scratchpad

RED SCAN SCAN

RSS SSA

SEL SpMV TRNS

UNI

100%

75%

50%

25%

Kernel

CPU-to-DPU

DPU-to-CPU

DPU-to-DPU

Speedup

BFS

BS GEMV HST-L HST-S MLP NW RED SCAN SCAN SEL SpMV TRNS

RSS SSA

UNI VA

Fig. 10: Multi-DPU’s latency breakdown (left axis) and achieved

speedups (right axis) when strong-scaling PrIM across 1/16/64 DPUs.

All non-black bars represent communication latency.

BFS

Arithmetic with branch

Synchronization

Fig. 9: Instruction mix when executing with a single DPU.

Fig. 7: Number of issuable threads by DPU scheduler each cycle,

binned per each category (left axis) and average number of issuable

threads (right axis) when executing with 16 threads.

Arithmetic

DMA to/from DRAM

0% 8

25%

GEMV

50%

100%

75%

50%

25%

# Issuable threads

13 ~ 16

9 ~ 12

5 ~ 8

75%

1 ~ 4

100%

Time

large fraction of the instructions executed in HST-L and TRNS

dedicated to synchronization instructions (e.g., acquire,

release in UPMEM ISA). Future UPMEM ISA extensions

that enable busy waiting threads to transition into sleep mode

and only resume execution when they are woken up can

potentially reduce such inefficiency.

Fig. 8: Changes in the number of issueable threads (y-axis) in time

(x-axis) during the course of (a) BS, (b) GEMV, and (c) SCAN-SSA’s

execution. For clear visualization, the y-axis shows the number of

issuable threads averaged over 10,000 consecutive cycles (i.e., cycles

with zero issuable threads are not shown clearly as they are smoothed

out while averaging).

C. Strong Scaling with Multi-DPUs

into the pipeline by categorizing which bin it falls under. As

depicted, workloads suffering from sub-optimal performance

generally exhibit a higher fraction of its runtime with less TLP

(i.e., ‘0’ issuable threads in left axis of Figure 7), rendering

the DPU to lose compute throughput (Figure 5). While

insightful, the analysis in Figure 7 cannot capture the temporal

variation in TLP or any phase behavior at runtime, which can

add another level of insights for architectural exploration.

PIMulator enables the analysis of how TLP fluctuates as

execution progresses, as shown in Figure 8. Although some

workloads consistently exhibit low (BS) or high (GEMV) TLP,

others exhibit a mix of high-and-low TLP behavior (SCAN-

SSA), providing valuable insights to understand the runtime

dynamics of a workload.

Instruction mix. Finally, Figure 9 shows the instruction

mix of PrIM when executed with a 1/4/16 threaded single

DPU. PIMulator uncovers a couple of interesting observations

as follows. First, with the exception of BFS, the number

of load/store instructions to the scratchpad memory (red)

generally outweighs the number of DMA instructions to

the per-bank DRAM (yellow). This is an artifact of the

scratchpad-centric programming model of UPMEM-PIM,

i.e., the register data operands can only be loaded from

the scratchpad and the programmer must manually initiate

DRAM→scratchpad copies to move the working set closer

to the processor. Consequently, to make sure the scratchpad

accesses do not cause a performance bottleneck, the DPU

microarchitecture is designed to guarantee a short, single cycle

latency in handling load/store instructions. Second, although

the compute utilization of certain workloads like HST-L and

TRNS seemingly look decent (Figure 5), a significant portion

of its runtime is effectively wasted as it is busy waiting to

acquire locks (e.g., mutex_lock). This is illustrated by the

Figure 10 shows the latency breakdown when each PrIM

benchmark is parallelized across 1, 16, and 64 DPUs using

strong-scaling, i.e., benchmark’s working set remains identical,

so perfect strong-scaling will reduce latency proportional to

the number of DPUs. In general, the majority of PrIM’s per-

formance scales well when parallelized across multiple DPUs

because the communication size per DPU gets proportionally

reduced as a function of the DPUs concurrently executing.

BFS, BS, and NW, however, exhibit sub-linear scaling because

the communication size gets larger as the number of DPUs is

increased. It is also worth noting that for some benchmarks

like SCAN-RSS, SCAN-SSA, SEL, UNI, and VA, transferring

input (CPU→DPU) and output (DPU→CPU) data dominates

the total execution time. For these benchmarks, the latencies

to copy the input/output data over CPU↔DPU channel are

not being effectively hidden by overlapping it with DPU’s

kernel execution time. Future versions of UPMEM SDK

which provide programming semantics that facilitate flexible

kernel partitioning and task scheduling (e.g., CUDA stream,

CUDA dynamic parallelism [54]) will likely enable further

performance improvements.

V. P ATHFINDING F UTURE PIM A RCHITECTURES

In this section, we demonstrate PIMulator’s availability

and extensibility through a series of case studies, showing

how PIMulator can be utilized to explore architectural sup-

port required for future PIM architectures to become more

performant, robust, and secure.

A. Case Study #1: UPMEM-PIM with SIMT Processing

The baseline UPMEM-PIM employs a scalar processor

leveraging thread-level parallelism to maximize performance.

7(a)

0x4270

is only alleviated by the more aggressive design which scales

up MRAM bandwidth further with SIMT+AC+4x/16x.

Key takeaways: UPMEM-PIM’s SPMD programming

model makes its hardware architecture amenable for data-

parallel processing via SIMT vector execution. UPMEM-

PIM’s baseline memory system, however, is not sufficiently

provisioned to sustain the higher DRAM read bandwidth

requirements of vector execution, resulting in limited speedup

with a naively implemented SIMT PIM design. Properly

optimizing the PIM memory system to maximize bandwidth

utilization (e.g., memory address coalescing, higher DRAM

read bandwidth) will thus be crucial for future SIMT vector

based PIM designs to fully unlock the potential of SIMT.

1011 8

25% 15 12

50% …

Idle (Revolver)

75% …

Idle (Memory)

IPC

…

Issuable

Idle (RF)

100% 0x2450

DRAM

Bank

1100 0

Vector

Active

Mask

Fetch

Warp

Instruction

Memory

Decode

SIMT Thread

Scheduler

(b)

Fig. 11: (a) SIMT based DPU architecture modeled using PIMulator,

(b) performance (right axis) achieved for GEMV. The max IPC of

Base and all SIMT designs are 1 and 16, respectively.

Recent domain-specific PIMs [14], [19], on the other hand,

leverage data-level parallelism by employing vector processing

to boost its performance for key machine learning primitives

(e.g., matrix multiplication). We observe that UPMEM’s

SPMD programming model renders its hardware architecture

to similarly reap out performance benefits of data-parallel

execution by employing a SIMT (single-instruction-multiple-

thread) vector processor [54]. In this subsection, we augment

the baseline UPMEM-PIM architecture as follows to analyze

the performance benefits of employing SIMT vector processing.

First, the processor pipeline is augmented with a vector register

file which an N -way vector unit read (write) vector operands

from (to). Similar to the notion of “warps” in CUDA, we group

N consecutive tasklets as the (grouped) thread scheduling

granularity to the N -way vector unit which executes N scalar

instructions in lockstep for vector processing. Similar to SIMT

GPUs, a memory address coalescing operation [54] is applied

among the grouped N scalar load/store instructions which

helps maximize memory bandwidth utilization by minimizing

the effect of SIMT memory divergence [72]. SIMT control

divergence [73] is handled dynamically at runtime using

each individual thread’s program counter values to only

execute scalar threads executing the same instruction over

the vector lanes, masking out inactive threads from execution

as appropriate, similar to how recent NVIDIA GPUs (post

Volta GPU) handle SIMT control divergence [74].

Figure 11 shows the performance achieved for GEMV,

a key primitive in machine learning which recent domain-

specific PIMs are optimized for. The figure first shows baseline

UPMEM-PIM (Base), which is additively augmented with 1)

16-way SIMT vector unit without memory address coalescing

(SIMT) and 2) with address coalescing (SIMT+AC). Both

of these SIMT design points have MRAM read bandwidth

identical to Base. Finally, another design point that scales

up MRAM read bandwidth by increasing DRAM operating

frequency by 4×/16× (SIMT+AC+4x/16x) is explored to

evaluate the upperbound performance with SIMT. As depicted,

augmenting UPMEM-PIM with a 16-way vector unit (SIMT)

provides a mere 2.6× speedup as performance is heavily

bottlenecked on MRAM read bandwidth. Adding the memory

coalescer (SIMT+AC) helps better utilize memory bandwidth

and provides an additional 1.9× speedup vs. SIMT (4.6× vs.

Base). Even with memory address coalescing (AC), however,

the memory-boundedness of SIMT execution persists with

SIMT+AC, leaving performance left on the table, one which

B. Case Study #2: ILP-enhanced PIM Architectures

Today’s commercial PIM processors employ a simple, in-

order pipeline without any sophisticated microarchitectures

to extract ILP for higher performance (e.g., superscalar,

super-pipelining, . . .) [14], [15], [18], [53]. As discussed in

Figure 6, such wimpy PIM processor design point leaves

significant performance left on the table, as conventionally

memory-bound workloads now fall under the compute-bound

regime with PIM (Section IV-A). We believe such design

decision was inevitable because current generation of PIM

processors are fabricated on a density-optimized technology

node (e.g., ≥20 nm DRAM technology for HBM-PIM and

UPMEM-PIM [15], [46]) posing several design constraints that

prevent advanced microarchitecture designs. That being said,

future PIM architectures with more flexibility in area, power,

and thermal budget can certainly consider relatively more

aggressive, performance-oriented design points with higher

operating frequency and ILP-enhancing microarchitectures.

In this case study, we use PIMulator to see how much

performance can be unlocked in PrIM’s “compute-bound”

workload by augmenting UPMEM-PIM’s DPU with ILP

enhancing optimizations. Figure 12 summarizes our ablation

study on how much the baseline UPMEM-PIM’s performance

(denoted “Base”) can be improved by adding the following

features in an additive manner: (D) address the scheduling

constraint enforced with baseline revolver pipeline by enabling

data forwarding across instructions without data dependencies

within the same thread to execute, (R) merging the odd/even

RF into a single one but double the read bandwidth to eliminate

baseline RF’s structural hazard, (S) 2-way superscalar in-order

pipeline to better leverage parallelism, and finally (F) doubling

DPU’s operating frequency to 700 MHz. As depicted, the

addition of these microarchitecture techniques substantially

improve the performance of PrIM’s compute-bound workloads

(avg 2.7×, max 6.2× speedup) as it successfully addresses the

performance bottlenecks discussed in Figure 6. Interestingly,

with the addition of (D+R+S) features to baseline UPMEM-

PIM, several workloads become relatively more memory-

bound (i.e., larger fraction of Idle(Memory)) so the benefits

of higher operating frequency (F) is less pronounced for these

workloads (e.g., GEMV, VA). A fundamental reason why

baseline UPMEM-PIM cannot fully reap out the potential of

8HST-L

HST-S

MLP

RED

SEL

SpMV

TRNS

UNI

SCAN-SSA

SCAN-RSS

5.6

4.2

2.8

1.4

Speedup

Idle (RF)

GEMV

Idle (Revolver)

Idle (Memory)

BFS

Issuable

Base+(D+R)

Base+(D+R+S)

Base+(D+R+S+F)

100%

75%

50%

25%

Base

Base+(D+R+S+F)

C. Case Study #3: Multi-tenant Execution in PIM

Multi-tenancy is one of the most important features for

processors to provide for cloud deployment as it helps better

saturate the processor’s compute and memory resources, reduc-

ing its total cost of ownership. As such, current CPUs/GPUs

come with a variety of hardware/software features that

support multi-tenancy with performance isolation and security

guarantees (e.g., CPU cache partitioning [75], [76], NVIDIA’s

multi-instance GPU [77], etc. [78], [79]). Given UPMEM-

PIM’s abundant compute and memory throughput (e.g., an

aggregate compute and memory throughput of 0.896 TOPS

and 2.5 TB/sec of memory bandwidth in a 40 ranked UPMEM-

PIM system), having a proper multi-tenancy support will be

vital for future PIM architectures, especially when seeking for

industrial adoption by cloud vendors.

Unfortunately, our case study reveals that current com-

mercial PIM chips (whether it be UPMEM-PIM or domain-

specific PIMs [14], [15], [18], [19]) are not able to meet

the requirements of multi-tenancy, not just from a hard-

ware/software perspective, but also from its programming

model’s perspective. Due to space limitations, let us focus

our attention on two important conditions to be met for multi-

tenancy. First, co-located workloads should securely execute

without interfering with each other (i.e., “security” guarantees).

Second, co-located workloads must not be aware of the fact

that they are concurrently executing (i.e., “transparency” to

co-located applications). We discuss each of these challenges

below.

Security. One of the fundamental architectural support

that is needed for secure execution is isolating the mem-

ory address space of co-located applications using MMU’s

address translation capability. Practically all commercial PIM

systems [14], [15], [18], [53] are designed without an MMU,

a point we emphasized in Section II-C with UPMEM-PIM’s

physical addressing based memory model. Therefore, co-

locating multiple workloads with address space isolation is

fundamentally impossible in MMU-less PIM architectures.

Aside from such security benefits, PIM chips with an MMU

can greatly enhance programmer productivity by virtualizing

the memory address space, i.e., it can separate the physical

allocation of data in PIM DRAM against its logical allocation

within the virtual address space. As discussed in Section II-B,

copying data from CPU to UPMEM-PIM’s DRAM (MRAM)

requires the programmer to painstakingly derive the physical

Fig. 12: Ablation study to explore UPMEM-PIM’s possible performance improvements when baseline DPU with 16 threads is additively

enhanced with data forwarding logic (D), unified RF with 2× read bandwidth to remove hazards at RF (R), 2-way superscalar in-order

pipeline (S), and doubling the operating frequency to 700 MHz (F).

BFS

BS GEMV HST-L HST-S MLP

RED SCAN SCAN SEL SpMV TRNS

RSS SSA

UNI

Fig. 13: Speedup achieved when scaling up the MRAM-to-WRAM

bandwidth by four times (×1 to ×4). The evaluated design points

are 1) baseline UPMEM-as-is (Base) and 2) UPMEM with all the

ILP optimizations (Base+(D+R+S+F)) discussed in Figure 12.

these ILP optimizations is because of the large performance

gap between WRAM bandwidth (2, 800 MB/sec) vs. MRAM-

to-WRAM bandwidth (600-700 MB/sec). More concretely, for

those workloads exhibiting low data locality, the performance

becomes relatively MRAM access bound and renders any op-

timizations that resolve the compute-boundness of a workload

ineffective (e.g., all data points exhibiting high fraction of

Idle(Memory) in Figure 12). Note that the existing 600-700

MB/sec of MRAM-to-WRAM bandwidth is not a fundamental

constraint because the maximum memory bandwidth that can

be reaped out at the MRAM (DRAM) “bank” level is much

higher (up to several GB/sec of bandwidth), i.e., the limited

600-700 MB/sec of MRAM bandwidth is simply a design point

pursued by UPMEM-PIM architects for this particular PIM

design. Using PIMulator, we conduct a sensitivity study that

scales up the MRAM-to-WRAM read bandwidth and analyze

its performance implication for memory-bound workloads. As

shown in Figure 13, the speedup is more pronounced with

the ILP-enhanced UPMEM-PIM designs (red lines) because

it exhibits more memory-boundedness as shown in Figure 12.

Contrarily, benefit of MRAM bandwidth scaling is minimal for

workloads still exhibiting compute-boundedness even under

ILP optimizations (HST-L, HST-S, MLP, TRNS, TS). Same

principle holds for the baseline UPMEM as-is (blue lines)

where the only noticeable speedup with MRAM scaling is

observed only for BS and SpMV which are already heavily

memory-bound even without ILP optimizations, the other

remaining compute-bound workloads achieving little speedup.

Key takeaways: Many data-intensive workloads exhibit a

compute-bound behavior when executed with PIM. Enhancing

PIM’s computational throughput will therefore become much

more important in future PIM designs. Using PIMulator, we

demonstrate the efficacy of various ILP-enhancing microar-

chitectural techniques for future PIM architectures, improving

the performance of several compute-bound PIM workloads.

9Linker

BSS Section

MRAM Section

iram.bin

wram.bin

mram.bin

Instruction

Memory

(IRAM)

DATA Section

Hardware Performance Simulator

TEXT Section

location in DPU’s DRAM because UPMEM ISA is currently

based on physical addressing. Having a proper MMU support

will enable more flexible allocation of data across the physical

address space and can also provide “pointer-is-a-pointer”

semantics to enhance programmability [55], [80], [81], [82].

In this case study, we add a hardware MMU to UPMEM-

PIM, using our PIMulator, and study its performance im-

plications. Our MMU model employs a single page-table

walker (page size of 4 KB) backed with a single-level, 16-

entry fully-associative TLB. The page-tables are assumed

to be stored in DPU’s local DRAM bank and the access

latency to the TLB is assumed as a single DPU clock cycle.

We utilize such mechanism to translate all DRAM (MRAM)

accesses from its virtual address to its physical address and

measure its performance regression. Overall, PrIM experiences

an average performance loss of 0.8% (max 14.1%) by adding

address translations to DPU’s memory accesses. Such low

performance overhead is mainly attributed to UPMEM’s

scratchpad-centric memory model where data transfers across

DRAM↔scratchpad are orchestrated in coarse-grained chunks

(several KBs) over DMA instructions, exhibiting high spatial

locality and thus achieving high TLB hit rates. Furthermore,

DPU cores are clocked at 350 MHz frequency, rendering its

memory access latency to be in the range of several tens of

DPU clock cycles (unlike CPUs/GPUs which operate in the

GHz range and exhibit hundreds of CPU/GPU processor cycles

of memory latency), experiencing much less TLB miss penalty

than CPUs/GPUs. Overall, our case study demonstrates the

practicality of adding a functional MMU architecture to future

PIM designs.

Transparency. We believe that multi-tenant execution under

the current UPMEM programming model is not practical be-

cause of its scratchpad-centric programming model. Consider

a scenario where we seek to co-locate two PrIM benchmarks,

a memory-bound BS and a compute-bound TS, which exhibit

complementary resource requirements (as quantified in Fig-

ure 5) and are perhaps the perfect candidates for multi-tenant

execution. Unfortunately, the BS and TS each require using

the same scratchpad (WRAM)’s heap via a memory allocation

API call (mem_alloc() in UPMEM SDK, e.g., line 35-37

in Figure 2(b)), which leads to exceeding the total size of

scratchpad (WRAM)’s heap size. Consequently, co-locating

both of these workloads requires non-trivial amount of changes

to both co-located PrIM programs, arguably an unacceptable

requirement to enforce to end-user applications. More crucially,

it directly violates the transparency requirement we previously

discussed, rendering a scratchpad-centric PIM programming

model ill-suited for multi-tenant execution.

Consequently, our analysis reveals that future PIM should

also employ on-demand caches, rather than singlehandedly

relying on scratchpads, to reap data locality benefits. PIM

programs running on top of an on-demand cache will be able

to leverage data locality in an opportunistic manner while not

having to change the program itself. In the next subsection, our

next case study details the feasibility of supporting on-demand

cache architectures for future PIM designs.

Working

Memory

(WRAM)

DRAM Bank

(MRAM)

DDR4-2400

Configuration

(a) Scratchpad-centric UPMEM-PIM architecture

Linker

BSS Section

MRAM Section

wram.bin

DATA Section

Hardware Performance Simulator

TEXT Section

Instruction

Cache

Data

Cache

Working

Memory

(WRAM)

DDR4-2400

Configuration

(b) Cache-centric UPMEM-PIM architecture

Fig. 14: Modeling a (a) scratchpad-centric and (b) cache-centric

UPMEM-PIM architecture in PIMulator.

Key takeaways: Supporting multi-tenancy in PIM requires

security and transparency to the co-located workloads. To

enhance security in PIM architectures, we augment PIMulator

with an MMU to quantify the performance overheads of

address translations and observe an average 0.8% (max

14.1%) latency increase, demonstrating the practicality of an

MMU-enabled PIM design. Guaranteeing transparency to co-

located PIM workloads under UPMEM’s current, scratchpad-

centric programming model is a different story, however, as it

requires significant changes to the co-located programs, a non-

option to begin with for transparent multi-tenant execution.

Having an on-demand cache architecture supported in PIM

can bridge this gap, opening the door for multi-tenant PIM

architectures.

D. Case Study #4: On-demand Caches vs. Scratchpads

As discussed in Section II-B, UPMEM’s scratchpad-centric

programming model requires the programmer to explicitly

orchestrate the data movements across two distinct ad-

dress spaces, the DRAM space and scratchpad space (i.e.,

MRAM↔WRAM). This is because the DPU threads can only

load (store) data from (to) the scratchpad but cannot directly

address data mapped in the DRAM space. Using PIMulator’s

custom-designed linker, this subsection conducts the cache vs.

scratchpad case study based on the following methodology.

1) The open-source UPMEM compiler does not limit the

data size the programmer can allocate and copy into

WRAM (scratchpad) space. Concretely, compiling an

UPMEM-PIM program to an assembly-level code whose

scratchpad allocation size exceeds the physical WRAM

capacity in itself does not cause any compilation errors.

During the linking process, however, if the WRAM data

allocation size exceeds the physical WRAM capacity, the

UPMEM linker generates a linking error as the hardware

UPMEM-PIM chip cannot execute the compiled codes

properly (see Section III-A for discussion on UPMEM

linker’s key properties).

2) Because PIMulator’s linker is designed to flexibly relo-

cate and map a given address region to anywhere in the

physical address space, we take the following measure to

10BFS

GEMV HST-L HST-S MLP

RED SCAN SCAN SEL SpMV TRNS

RSS SSA

UNI

1.2

0.8

0.6

0.4

0.2

1 1

Fig. 15: Performance of scratchpad-centric vs. cache-centric UPMEM-

PIM (normalized to scratchpad-centric design).

2 2 4 4 8 8 16 16

Number of threads

(a)

1.2

0.8

0.6

0.4

0.2

1.2

0.8

0.6

0.4

0.2

1.2

0.8

0.6

0.4

0.2

Bytes read (cache-centric)

Latency (cache-centric)

Bytes read (scratchpad-centric)

Latency (scratchpad-centric)

Cache-centric

Scratchpad-centric

4.7 6.8

8 16

Number of threads

(b)

Fig. 16: Bytes read from DRAM (left axis, normalized) and end-to-

end execution time (right axis, normalized) for (a) BS and (b) UNI.

emulate an alternative, cache-centric UPMEM-PIM (a)

whose DPU threads can directly address data allocated

in DRAM without having to move data to the scratchpad

(i.e., there is no notion of scratchpad under this model),

while (b) also leveraging data locality by storing recently

accessed data within the cache.

3) PIMulator emulates cache-centric UPMEM-PIM as

follows. First, the input data is allocated directly in

the WRAM (scratchpad) address space, unlike the

baseline UPMEM model whose input data must be

copied from MRAM (per-bank DRAM) to WRAM

using DMA instructions. The WRAM-allocated input

data, which is directly addressable by the DPU threads

using load/store instructions (as compiled by the original

UPMEM compiler), is then relocated by PIMulator’s

linker to be mapped into a physical address region which

is backed by our cycle-level hardware performance

simulator, modeling it as a DDR4-2400 [51] compatible

DRAM device (Figure 14(b)). By adding a cycle-level

cache simulator in between the DPU processor and

the (DRAM-emulated) WRAM address space, the data

referenced by the load/store instructions will be stored

on-demand to this cache simulator, allowing us to

explore the cache vs. scratchpad design space.

Figure 15 compares the performance of scratchpad vs. cache

in UPMEM-PIM for PrIM. The cache-centric UPMEM-PIM

employs an instruction cache and a data cache, each configured

as an 8-way set-associative cache with LRU replacement policy

and 24 KB and 64 KB capacity, respectively, identical to

the instruction memory (IRAM) and scratchpad (WRAM)

space provisioned under the baseline UPMEM-PIM. For

certain workloads, scratchpad performs better than caches

(e.g., UNI) while the opposite also holds true for others (e.g.,

BS). To better understand the reason behind such results,

Figure 16 shows the number of bytes read from DRAM

during the course of BS and UNI’s execution. In general,

we can observe that the execution time is highly correlated

with the number of bytes read. For example, under the

memory-bound BS, the scratchpad based execution with 16

threads incurs 5.1× higher memory read traffic than using

caches. For BS, it is challenging to statically estimate the

right amount of data to upload into the scratchpad, which

results in a severe overfetching of useless data and causing

a performance bottleneck to this memory-bound workload.

Under such scenario, a purely on-demand caching strategy

performs much favorably in terms of fetching (relatively)

the right amount of data within the cache and maximizes

data locality. In contrast, workloads like UNI performs much

better with scratchpads where carefully orchestrating data

movements perform better than the opportunistic cache design.

Determining which design point is more favorable for PIM

architectures is not the purpose of this case study. Rather we

seek to demonstrate the practical benefits and feasibility of a

cache-centric PIM architecture, motivating future work in this

research space.

Key takeaways: Similar to conventional CPUs/GPUs, an on-

demand cache design can do a better job in leveraging locality

for PIM when its memory access pattern cannot be optimally

determined at compile time, a scenario where scratchpad

based design points can perform poorly.

VI. R ELATED W ORK

The initial concept of PIM dates back in the 1970s [83]

with numerous follow-on works [2], [3], [4], [5]. With the

proliferation of today’s domain-specific architectures, there

exists a variety of PIM or near-memory processing studies [14],

[16], [18], [20], [21], [22], [23], [24], [25], [26], [27], [28],

[29], [30], [31], [32], [33], [34], [35], [36], [37], [38], [39],

[40], [41], [42], [84], [85], [86], [87], [88], [89], [90], [91],

[92], [93], [94], [95], [96], [97]. There is also several prior

work on PIM exploring compiler issues [8], [10], [98], data

coherency [11], [12], [13], [99], synchronization [100], QoS

aware runtime and scheduling for PIM [101], among many

others [102], [103], [104], [105], [106], [107], [108]. This

paper focuses on characterizing the first real-world general-

purpose PIM via our PIMulator, pathfinding important research

directions for future PIMs. Below we summarize other relevant

works on characterizing real-world PIM and PIM simulators.

Analysis on real-world PIM devices. There has been

several recent works that characterizes commercial PIM tech-

nologies [14], [46], [47], [50], [71], [109], [110], [111], [112],

[113], [114]. Gómez-Luna et al. [46], [71] developed the PrIM

benchmark suite and conducted a workload characterization

on UPMEM-PIM. There is also several work exploring the

applicability of UPMEM-PIM for accelerating dense/sparse

linear algebra, databases, data analytics, graph processing,

bioinformatics, image processing, compression, simulation,

encryption, and etc [46], [47], [50], [71], [109], [110], [113],

[114], with more recent work exploring UPMEM-PIM’s

applicability for accelerating machine learning [112]. Lee

et al. [14] discusses the hardware/software architecture for

Samsung’s HBM-PIM architecture. There is also a recent

11TABLE III: Comparison of PIMulator vs. other PIM simulators.

ISA

Implementation

Frontend

(Trace vs. Execution)

ISA & Linker

Customization

Validation Against

Real PIM Hardware

Multi-threaded

Simulation

Lines of Code

(LoC)

Simulation Rate

(KIPS)

PIMSim

[115]

x86, ARM,

SPARC

Ramulator

-PIM [116] MultiPIM

[120] MPU-Sim

[119] PIMulator

x86 x86 PTX UPMEM

In-house In-house

Execution Execution

In-house Zsim

+ Ramulator Trace Trace

+ Execution Zsim

+ Ramulator

+ BookSim

Trace

+ Execution X X X X O

X X X X O

X O O X X

30 K 35 K 92 K 12 K 52 K

N/A N/A N/A N/A 3

[2]

[3]

[4]

[5]

[6]

work by Liu et al. [16] which explores the applicability of

Samsung’s near-memory processor AxDIMM for accelerating

recommendation models.

Simulators for PIM. PIMSim [115] supports a configurable

PIM logic modeling, memory organization, interconnection,

and provides co-simulation with other simulation frameworks.

Ramulator-PIM [66], [116], [117] integrates ZSim [118] and

Ramulator [66] to simulate PIM-enabled memory. MPU-

Sim [119] models a near-bank processing architecture which

supports NVIDIA CUDA’s SIMT programming model [54].

MultiPIM [120] enables the simulation of PIM systems based

on 3D stacked memory with features to explore multi-stack

interconnects with virtual memory support. Our PIMulator

framework is unique in that it targets a real-world commercial

PIM system based on UPMEM-PIM. Table III summarizes

key differences between PIMulator and other PIM simulators.

[7]

[8]

[9]

[10]

[11]

[12]

VII. C ONCLUSION

In this work, we present a novel simulation framework

named PIMulator which targets UPMEM’s commercial

general-purpose PIM architecture. Using PIMulator, we present

our detailed characterization on wide range of real PIM

programs and showcase PIMulator’s applicability for com-

puter architecture research. Furthermore, we conduct several

case studies on extending our PIMulator to pathfind critical

research areas that require further investigation from computer

architects, which we believe will be vital for future PIM

architectures to evolve into a first class computing citizen.

[13]

[14]

[15]

R EFERENCES

[1] M. Anderson, B. Chen, S. Chen, S. Deng, J. Fix, M. Gschwind,

A. Kalaiah, C. Kim, J. Lee, J. Liang, H. Liu, Y. Lu, J. Montgomery,

A. Moorthy, S. Nadathur, S. Naghshineh, A. Nayak, J. Park, C. Petersen,

M. Schatz, N. Sundaram, B. Tang, P. Tang, A. Yang, J. Yu, H. Yuen,

Y. Zhang, A. Anbudurai, V. Balan, H. Bojja, J. Boyd, M. Breitbach,

C. Caldato, A. Calvo, G. Catron, S. Chandwani, P. Christeas, B. Cottel,

B. Coutinho, A. Dalli, A. Dhanotia, O. Duncan, R. Dzhabarov, S. Elmir,

C. Fu, W. Fu, M. Fulthorp, A. Gangidi, N. Gibson, S. Gordon, B. P.

Hernandez, D. Ho, Y.-C. Huang, O. Johansson, S. Juluri, S. Kanaujia,

M. Kesarkar, J. Killinger, B. Kim, R. Kulkarni, M. Lele, H. Li, H. Li,

Y. Li, C. Liu, J. Liu, B. Maher, C. Mallipedi, S. Mangla, K. K.

Matam, J. Mehta, S. Mehta, C. Mitchell, B. Muthiah, N. Nagarkatte,

A. Narasimha, B. Nguyen, T. Ortiz, S. Padmanabha, D. Pan, A. Poojary,

Y. Qi, O. Raginel, D. Rajagopal, T. Rice, C. Ross, N. Rotem, S. Russ,

K. Shah, B. Shan, H. Shen, P. Shetty, K. Skandakumaran, K. Srinivasan,

R. Sumbaly, M. Tauberg, M. Tzur, S. Verma, H. Wang, M. Wang,

B. Wei, A. Xia, C. Xu, M. Yang, K. Zhang, R. Zhang, M. Zhao,

W. Zhao, R. Zhu, A. Mathews, L. Qiao, M. Smelyanskiy, B. Jia,

[16]

[17]

[18]

and V. Rao, “First-Generation Inference Accelerator Deployment at

Facebook,” in arxiv.org, 2021.

D. Elliott, M. Stumm, W. Snelgrove, C. Cojocaru, and R. Mckenzie,

“Computational RAM: Implementing Processors in Memory,” IEEE

Design & Test of Computers, vol. 16, no. 1, pp. 32–41, 1999.

D. Patterson, T. Anderson, N. Cardwell, R. Fromm, K. Keeton,

C. Kozyrakis, R. Thomas, and K. Yelick, “A Case for Intelligent

RAM,” IEEE Micro, vol. 17, no. 2, pp. 34–44, 1997.

M. Oskin, F. Chong, and T. Sherwood, “Active Pages: A Computation

Model for Intelligent Memory,” in Proceedings of the International

Symposium on Computer Architecture (ISCA), 1998.

J. Draper, J. Chame, M. Hall, C. Steele, T. Barrett, J. LaCoss,

J. Granacki, J. Shin, C. Chen, C. W. Kang, I. Kim, and G. Daglikoca,

“The Architecture of the DIVA Processing-In-Memory Chip,” in

Proceedings of the 16th International Conference on Supercomputing

(ICS), 2002.

N. S. Kim, “Practical Challenges in Supporting Function in Memory,”

in IEEE Asian Solid-State Circuits Conference (A-SSCC), 2018.

O. Mutlu, S. Ghose, J. Gómez-Luna, and R. Ausavarungnirun, “A

Modern Primer on Processing in Memory,” in arxiv.org, 2020.

K. Hsieh, E. Ebrahim, G. Kim, N. Chatterjee, M. O’Connor, N. Vi-

jaykumar, O. Mutlu, and S. W. Keckler, “Transparent Offloading

and Mapping (TOM): Enabling Programmer-Transparent Near-Data

Processing in GPU Systems,” in Proceedings of the International

Symposium on Computer Architecture (ISCA), 2016.

A. Pattnaik, X. Tang, A. Jog, O. Kayiran, A. K. Mishra, M. T. Kandemir,

O. Mutlu, and C. R. Das, “Scheduling Techniques for GPU Architec-

tures with Processing-In-Memory Capabilities,” in Proceedings of the

International Conference on Parallel Architectures and Compilation

Techniques (PACT), 2016.

A. Devic, S. B. Rai, A. Sivasubramaniam, A. Akel, S. Eilert, and J. Eno,

“To PIM or Not for Emerging General Purpose Processing in DDR

Memory Systems,” in Proceedings of the International Symposium on

Computer Architecture (ISCA), 2022.

J. Ahn, S. Yoo, O. Mutlu, and K. Choi, “PIM-Enabled Instructions:

A Low-Overhead, Locality-Aware Processing-in-Memory Architec-

ture,” in Proceedings of the International Symposium on Computer

Architecture (ISCA), 2015.

A. Boroumand, S. Ghose, M. Patel, H. Hassan, B. Lucia,

R. Ausavarungnirun, K. Hsieh, N. Hajinazar, K. T. Malladi, H. Zheng,

and O. Mutlu, “CoNDA: Efficient Cache Coherence Support for Near-

data Accelerators,” in Proceedings of the International Symposium on

Computer Architecture (ISCA), 2019.

A. Boroumand, S. Ghose, M. Patel, H. Hassan, B. Lucia, K. Hsieh,

K. T. Malladi, H. Zheng, and O. Mutlu, “LazyPIM: An Efficient Cache

Coherence Mechanism for Processing-in-Memory,” IEEE Computer

Architecture Letters, vol. 16, no. 1, pp. 46–50, 2017.

S. Lee, S.-h. Kang, J. Lee, H. Kim, E. Lee, S. Seo, H. Yoon, S. Lee,

K. Lim, H. Shin, J. Kim, O. Seongil, A. Iyer, D. Wang, K. Sohn,

and N. S. Kim, “Hardware Architecture and Software Stack for PIM

Based on Commercial DRAM Technology : Industrial Product,” in

Proceedings of the International Symposium on Computer Architecture

(ISCA), 2021.

Y.-C. Kwon, S. H. Lee, J. Lee, S.-H. Kwon, J. M. Ryu, J.-P. Son,

O. Seongil, H.-S. Yu, H. Lee, S. Y. Kim, Y. Cho, J. G. Kim, J. Choi,

H.-S. Shin, J. Kim, B. Phuah, H. Kim, M. J. Song, A. Choi, D. Kim,

S. Kim, E.-B. Kim, D. Wang, S. Kang, Y. Ro, S. Seo, J. Song,

J. Youn, K. Sohn, and N. S. Kim, “25.4 A 20nm 6GB Function-In-

Memory DRAM, Based on HBM2 with a 1.2TFLOPS Programmable

Computing Unit Using Bank-Level Parallelism, for Machine Learning

Applications,” in Proceedings of the International Solid State Circuits

Conference (ISSCC), 2021.

L. Ke, X. Zhang, J. So, J.-G. Lee, S.-H. Kang, S. Lee, S. Han, Y. Cho,

J. H. Kim, Y. Kwon, K. Kim, J. Jung, I. Yun, S. J. Park, H. Park,

J. Song, J. Cho, K. Sohn, N. S. Kim, and H.-H. S. Lee, “Near-Memory

Processing in Action: Accelerating Personalized Recommendation with

AxDIMM,” IEEE Micro, vol. 42, no. 1, pp. 116–127, 2022.

J. H. Kim, S.-H. Kang, S. Lee, H. Kim, Y. Ro, S. Lee, D. Wang,

J. Choi, J. So, Y. Cho, J. Song, J. Cho, K. Sohn, and N. S. Kim,

“Aquabolt-XL HBM2-PIM, LPDDR5-PIM with In-Memory Processing,

and AXDIMM with Acceleration Buffer,” IEEE Micro, vol. 42, no. 3,

pp. 20–30, 2022.

M. He, C. Song, I. Kim, C. Jeong, S. Kim, I. Park, M. Thottethodi,

and T. N. Vijaykumar, “Newton: A DRAM-maker’s Accelerator-in-[19]

[20]

[21]

[22]

[23]

[24]

[25]

[26]

[27]

[28]

[29]

[30]

[31]

[32]

[33]

[34]

Memory (AiM) Architecture for Machine Learning,” in Proceedings

of the International Symposium on Microarchitecture (MICRO), 2020.

S. Lee, K. Kim, S. Oh, J. Park, G. Hong, D. Ka, K. Hwang, J. Park,

K. Kang, J. Kim, J. Jeon, N. Kim, Y. Kwon, K. Vladimir, W. Shin,

J. Won, M. Lee, H. Joo, H. Choi, J. Lee, D. Ko, Y. Jun, K. Cho,

I. Kim, C. Song, C. Jeong, D. Kwon, J. Jang, I. Park, J. Chun, and

J. Cho, “A 1ynm 1.25V 8Gb, 16Gb/s/pin GDDR6-based Accelerator-in-

Memory Supporting 1TFLOPS MAC Operation and Various Activation

Functions for Deep-Learning Applications,” in Proceedings of the

International Solid State Circuits Conference (ISSCC), 2022.

M. Zhang, Y. Zhuo, C. Wang, M. Gao, Y. Wu, K. Chen, C. Kozyrakis,

and X. Qian, “GraphP: Reducing Communication for PIM-based

Graph Processing with Efficient Data Partition,” in Proceedings of the

International Symposium on High-Performance Computer Architecture

(HPCA), 2018.

G. Dai, T. Huang, Y. Chi, J. Zhao, G. Sun, Y. Liu, Y. Wang, Y. Xie,

and H. Yang, “GraphH: A Processing-in-Memory Architecture for

Large-Scale Graph Processing,” IEEE Transactions on Computer-Aided

Design of Integrated Circuits and Systems, vol. 38, no. 4, pp. 640–653,

2019.

Y. Zhuo, C. Wang, M. Zhang, R. Wang, D. Niu, Y. Wang, and X. Qian,

“GraphQ: Scalable PIM-Based Graph Processing,” in Proceedings of

the International Symposium on Microarchitecture (MICRO), 2019.

P. Yao, L. Zheng, Y. Huang, Q. Wang, C. Gui, Z. Zeng, X. Liao, H. Jin,

and J. Xue, “ScalaGraph: A Scalable Accelerator for Massively Parallel

Graph Processing,” in Proceedings of the International Symposium on

High-Performance Computer Architecture (HPCA), 2022.

J. Ahn, S. Hong, S. Yoo, O. Mutlu, and K. Choi, “A Scalable

Processing-in-Memory Accelerator for Parallel Graph Processing,” in

Proceedings of the International Symposium on Computer Architecture

(ISCA), 2015.

M. Gao, J. Pu, X. Yang, M. Horowitz, and C. Kozyrakis, “TETRIS:

Scalable and Efficient Neural Network Acceleration with 3D Memory,”

in Proceedings of the International Conference on Architectural Support

for Programming Languages and Operating Systems (ASPLOS), 2017.

D. Kim, J. Kung, S. Chai, S. Yalamanchili, and S. Mukhopadhyay,

“Neurocube: A Programmable Digital Neuromorphic Architecture

with High-Density 3D Memory,” in Proceedings of the International

Symposium on Computer Architecture (ISCA), 2016.

Y. Kwon, Y. Lee, and M. Rhu, “TensorDIMM: A Practical Near-

Memory Processing Architecture for Embeddings and Tensor Opera-

tions in Deep Learning,” in Proceedings of the International Symposium

on Microarchitecture (MICRO), 2019.

L. Ke, U. Gupta, B. Y. Cho, D. Brooks, V. Chandra, U. Diril,

A. Firoozshahian, K. Hazelwood, B. Jia, H.-H. S. Lee, M. Li,

B. Maher, D. Mudigere, M. Naumov, M. Schatz, M. Smelyanskiy,

X. Wang, B. Reagen, C.-J. Wu, M. Hempstead, and X. Zhang,

“RecNMP: Accelerating Personalized Recommendation with Near-

Memory Processing,” in Proceedings of the International Symposium

on Computer Architecture (ISCA), 2020.

J. Park, B. Kim, S. Yun, E. Lee, M. Rhu, and J. H. Ahn, “TRiM:

Enhancing Processor-Memory Interfaces with Scalable Tensor Reduc-

tion in Memory,” in Proceedings of the International Symposium on

Microarchitecture (MICRO), 2021.

B. Asgari, R. Hadidi, J. Cao, D. E. Shim, S.-K. Lim, and H. Kim,

“FAFNIR: Accelerating Sparse Gathering by Using Efficient Near-

Memory Intelligent Reduction,” in Proceedings of the International

Symposium on High-Performance Computer Architecture (HPCA),

2021.

K. Hsieh, S. Khan, N. Vijaykumar, K. K. Chang, A. Boroumand,

S. Ghose, and O. Mutlu, “Accelerating Pointer Chasing in 3D-Stacked

Memory: Challenges, Mechanisms, Evaluation,” in Proceedings of the

International Conference on Computer Design (ICCD), 2016.

A. Boroumand, S. Ghose, G. F. Oliveira, and O. Mutlu, “Polynesia:

Enabling High-Performance and Energy-Efficient Hybrid Transac-

tional/Analytical Databases with Hardware/Software Co-Design,” in

Proceedings of the International Conference on Data Engineering

(ICDE), 2022.

Q. Deng, L. Jiang, Y. Zhang, M. Zhang, and J. Yang, “DrAcc: A

DRAM Based Accelerator for Accurate CNN Inference,” in Design

Automation Conference (DAC), 2018.

A. Shafiee, A. Nag, N. Muralimanohar, R. Balasubramonian, J. P.

Strachan, M. Hu, R. S. Williams, and V. Srikumar, “ISAAC: A Convo-

lutional Neural Network Accelerator with In-Situ Analog Arithmetic

[35]

[36]

[37]

[38]

[39]

[40]

[41]

[42]

[43]

[44]

[45]

[46]

[47]

[48]

[49]

[50]

[51]

[52]

[53]

[54]

[55]

[56]

[57]

[58]

[59]

in Crossbars,” in Proceedings of the International Symposium on

Computer Architecture (ISCA), 2016.

G. Dai, Z. Zhu, T. Fu, C. Wei, B. Wang, X. Li, Y. Xie, H. Yang, and

Y. Wang, “DIMMining: Pruning-Efficient and Parallel Graph Mining

on Near-Memory-Computing,” in Proceedings of the International

Symposium on Computer Architecture (ISCA), 2022.

X. Xie, Z. Liang, P. Gu, A. Basak, L. Deng, L. Liang, X. Hu, and

Y. Xie, “SpaceA: Sparse Matrix Vector Multiplication on Processing-in-

Memory Accelerator,” in Proceedings of the International Symposium

on High-Performance Computer Architecture (HPCA), 2021.

M. Lenjani, A. Ahmed, M. Stan, and K. Skadron, “Gearbox: A Case for

Supporting Accumulation Dispatching and Hybrid Partitioning in PIM-

based Accelerators,” in Proceedings of the International Symposium

on Computer Architecture (ISCA), 2022.

I. Fernandez, R. Quislant, E. Gutiérrez, O. Plata, C. Giannoula,

M. Alser, J. Gómez-Luna, and O. Mutlu, “NATSA: A Near-Data

Processing Accelerator for Time Series Analysis,” in Proceedings of

the International Conference on Computer Design (ICCD), 2020.

S. Li, D. Niu, K. T. Malladi, H. Zheng, B. Brennan, and Y. Xie,

“DRISA: A DRAM-based Reconfigurable In-Situ Accelerator,” in

Proceedings of the International Symposium on Microarchitecture

(MICRO), 2017.

H. Shin, D. Kim, E. Park, S. Park, Y. Park, and S. Yoo, “McDRAM:

Low Latency and Energy-Efficient Matrix Computations in DRAM,”

IEEE Transactions on Computer-Aided Design of Integrated Circuits

and Systems, vol. 37, no. 11, pp. 2613–2622, 2018.

S. Cho, H. Choi, E. Park, H. Shin, and S. Yoo, “McDRAM v2:

In-Dynamic Random Access Memory Systolic Array Accelerator to

Address the Large Model,” IEEE Access, vol. 8, pp. 135223–135243,

2020.

W. Huangfu, X. Li, S. Li, X. Hu, P. Gu, and Y. Xie, “MEDAL:

Scalable DIMM Based Near Data Processing Accelerator for DNA

Seeding Algorithm,” in Proceedings of the International Symposium

on Microarchitecture (MICRO), 2019.

UPMEM, 2022.

C. Lattner and V. Adve, “LLVM: A Compilation Framework for

Lifelong Program Analysis and Transformation,” in International

Symposium on Code Generation and Optimization (CGO), 2004.

UPMEM, “LLVM-based Compiler Stack Developed by UPMEM,”

2021.

J. Gómez-Luna, I. E. Hajj, I. Fernandez, C. Giannoula, G. F. Oliveira,

and O. Mutlu, “Benchmarking a New Paradigm: An Experimental

Analysis of a Real Processing-in-Memory Architecture,” in arxiv.org,

2021.

J. Nider, C. Mustard, A. Zoltan, J. Ramsden, L. Liu, J. Grossbard,

M. Dashti, R. Jodin, A. Ghiti, J. Chauzi, and A. Fedorova, “A Case

Study of Processing-in-Memory in Off-the-Shelf Systems,” in USENIX

Annual Technical Conference (ATC), 2021.

C. Lim, S. Lee, J. Choi, J. Lee, S. Park, H. Kim, J. Lee, and Y. Kim,

“Design and Analysis of a Processing-in-DIMM Join Algorithm: A

Case Study with UPMEM DIMMs,” in Proceedings of the International

Conference on Management of Data (SIGMOD), 2023.

L.-C. Chen, S.-Q. Yu, C.-C. Ho, Y.-H. Chang, D.-W. Chang, W.-

C. Wang, and Y.-M. Chang, “RNA-seq Quantification on Processing

in memory Architecture: Observation and Characterization NVM,”

in Proceedings of the IEEE Non-Volatile Memory Systems and

Applications Symposium (NVMSA), 2022.

J. Nider, J. Dagger, N. Gharavi, D. Ng, and A. Fedorova, “Bulk JPEG

Decoding on In-Memory Processors,” in Proceedings of the 15th ACM

International Conference on Systems and Storage (SYSTOR), 2022.

Samsung, “8Gb C-die DDR4 SDRAM x16,” 2017.

UPMEM, “Instruction Set Architecture,” 2021.

F. Devaux, “The True Processing In Memory Accelerator,” in Hot

Chips: A Symposium on High Performance Chips, 2019.

NVIDIA, “CUDA, release: 10.2.89,” 2020.

N. Sakharnykh, “Everything You Need to Know about Unified Memory,”

in NVIDIA GPU Technology Conference (GTC), 2018.

UPMEM, “UPMEM SDK,” 2021.

K. C. Knowlton, “A Fast Storage Allocator,” Communications of the

ACM, vol. 8, no. 10, pp. 623–624, 1965.

S. Ghemawat and P. Menage, “Tcmalloc: Thread-caching Malloc,”

2009.

T. Parr, The Definitive ANTLR 4 Reference. Pragmatic Bookshelf,

2012.[60] T. Parr, “ANTLR,” 2022.

[61] T. Parr, “ANTLR 4,” 2022.

[62] A. V. Aho, M. S. Lam, R. Sethi, and J. D. Ullman, Compilers:

Principles, Techniques, and Tools (2nd Edition). Addison-Wesley

Longman Publishing Co., Inc., 2006.

[63] UPMEM, “UPMEM DPU ABI,” 2021.

[64] UPMEM, “UPMEM Processing-In-Memory (PIM) Ultra-Efficient

Acceleration for Data-Intensive Applications (White Paper),” 2022.

[65] P. Rosenfeld, E. Cooper-Balis, and B. Jacob, “DRAMSim2: A Cycle

Accurate Memory System Simulator,” IEEE Computer Architecture

Letters, vol. 10, no. 1, pp. 16–19, 2011.

[66] Y. Kim, W. Yang, and O. Mutlu, “Ramulator: A Fast and Extensible

DRAM Simulator,” IEEE Computer Architecture Letters, vol. 15, no. 1,

pp. 45–49, 2015.

[67] “USIMM: The Utah SImulated Memory Module,” 2012.

[68] A. Bakhoda, G. L. Yuan, W. W. L. Fung, H. Wong, and T. M. Aamodt,

“Analyzing CUDA Workloads Using a Detailed GPU Simulator,” in

Proceedings of the International Symposium on Performance Analysis

of Systems Software (ISPASS), 2009.

[69] S. Rixner, W. Dally, U. Kapasi, P. Mattson, and J. Owens, “Memory

Access Scheduling,” in Proceedings of the International Symposium

on Computer Architecture (ISCA), 2000.

[70] Intel, “Intel® 64 and IA-32 Architectures Software Developer’s

Manual”,” 2023.

[71] “PrIM (Processing-In-Memory Benchmarks),” 2021.

[72] N. Chatterjee, M. O’Connor, G. H. Loh, N. Jayasena, and R. Balasub-

ramonia, “Managing DRAM Latency Divergence in Irregular GPGPU

Applications,” in Proceedings of the International Conference on High

Performance Computing, Networking, Storage and Analysis (SC), 2014.

[73] W. W. Fung, I. Sham, G. Yuan, and T. M. Aamodt, “Dynamic

Warp Formation and Scheduling for Efficient GPU Control Flow,”

in Proceedings of the International Symposium on Microarchitecture

(MICRO), 2007.

[74] NVIDIA, “NVIDIA Tesla V100 GPU Architecture (White Paper):

“Independent Thread Scheduling”,” 2017.

[75] ARM, “Arm DynamlQ Shared Unit-AE,” 2018.

[76] Intel, “Introduction to Memory Bandwidth Allocation,” 2019.

[77] NVIDIA, “Multi-Instance GPU User Guide,” 2020.

[78] Linux, “Cgroups,” 2004.

[79] K. Andi, “A NUMA API for Linux,” 2004.

[80] J. Power, M. D. Hill, and D. A. Wood, “Supporting x86-64 Address

Translation for 100s of GPU Lanes,” in Proceedings of the International

Symposium on High-Performance Computer Architecture (HPCA),

2014.

[81] B. Pichai, L. Hsu, and A. Bhattacharjee, “Architectural Support for

Address Translation on GPUs: Designing Memory Management Units

for CPU/GPUs with Unified Address Spaces,” in Proceedings of the

International Conference on Architectural Support for Programming

Languages and Operating Systems (ASPLOS), 2014.

[82] B. Hyun, Y. Kwon, Y. Choi, J. Kim, and M. Rhu, “NeuMMU:

Architectural Support for Efficient Address Translations in Neural

Processing Units,” in Proceedings of the International Conference

on Architectural Support for Programming Languages and Operating

Systems (ASPLOS), 2020.

[83] H. S. Stone, “A Logic-in-Memory Computer,” IEEE Transactions on

Computers, vol. C-19, no. 1, pp. 73–78, 1970.

[84] R. Hwang, T. Kim, Y. Kwon, and M. Rhu, “Centaur: A Chiplet-Based,

Hybrid Sparse-Dense Accelerator for Personalized Recommendations,”

in Proceedings of the International Symposium on Computer Architec-

ture (ISCA), 2020.

[85] B. Kim, J. Park, E. Lee, M. Rhu, and J. H. Ahn, “TRiM: Tensor

Reduction in Memory,” IEEE Computer Architecture Letters, vol. 20,

no. 1, pp. 5–8, 2021.

[86] H. Kim, H. Park, T. Kim, K. Cho, E. Lee, S. Ryu, H.-J. Lee, K. Choi,

and J. Lee, “GradPIM: A Practical Processing-in-DRAM Architecture

for Gradient Descent,” in Proceedings of the International Symposium

on High-Performance Computer Architecture (HPCA), 2021.

[87] A. Augusta and S. Idreos, “JAFAR: Near-Data Processing for

Databases,” in Proceedings of the International Conference on Man-

agement of Data (SIGMOD), 2015.

[88] S. Angizi, Z. He, A. S. Rakin, and D. Fan, “CMP-PIM: An Energy-

Efficient Comparator-based Processing-In-Memory Neural Network

Accelerator,” in Design Automation Conference (DAC), 2018.

[89] B. Y. Cho, J. Jung, and M. Erez, “Accelerating Bandwidth-Bound Deep

Learning Inference with Main-Memory Accelerators,” in Proceedings

of the International Conference on High Performance Computing,

Networking, Storage and Analysis (SC), 2021.

[90] S. Kang, S. Lee, B. Kim, H. Kim, K. Sohn, N. S. Kim, and E. Lee,

“An FPGA-Based RNN-T Inference Accelerator with PIM-HBM,” in

Proceedings of the International Symposium on Field-Programmable

Gate Arrays (FPGA), 2022.

[91] H. Asghari-Moghaddam, Y. H. Son, J. H. Ahn, and N. S. Kim,

“Chameleon: Versatile and Practical Near-DRAM Acceleration Archi-

tecture for Large Memory Systems,” in Proceedings of the International

Symposium on Microarchitecture (MICRO), 2016.

[92] H. Asghari-Moghaddam, A. Farmahini-Farahani, K. Morrow, J. H.

Ahn, and N. S. Kim, “Near-DRAM Acceleration with Single-ISA

Heterogeneous Processing in Standard Memory Modules,” IEEE Micro,

vol. 36, no. 1, pp. 24–34, 2016.

[93] A. Farmahini-Farahani, J. H. Ahn, K. Morrow, and N. S. Kim, “NDA:

Near-DRAM Acceleration Architecture Leveraging Commodity DRAM

Devices and Standard Memory Modules,” in Proceedings of the

International Symposium on High-Performance Computer Architecture

(HPCA), 2015.

[94] M. Alian, S. W. Min, H. Asgharimoghaddam, A. Dhar, D. K.

Wang, T. Roewer, A. McPadden, O. O’Halloran, D. Chen, J. Xiong,

D. Kim, W.-m. Hwu, and N. S. Kim, “Application-Transparent Near-

Memory Processing Architecture with Memory Channel Network,”

in Proceedings of the International Symposium on Microarchitecture

(MICRO), 2018.

[95] M. Alian and N. S. Kim, “NetDIMM: Low-Latency Near-Memory

Network Interface Architecture,” in Proceedings of the International

Symposium on Microarchitecture (MICRO), 2019.

[96] Y. Kwon, Y. Lee, and M. Rhu, “Tensor Casting: Co-Designing

Algorithm-Architecture for Personalized Recommendation Training,”

in Proceedings of the International Symposium on High-Performance

Computer Architecture (HPCA), 2021.

[97] Y. Lee, J. Chung, and M. Rhu, “SmartSAGE: Training Large-scale

Graph Neural Networks using In-Storage Processing Architectures,” in

Proceedings of the International Symposium on Computer Architecture

(ISCA), 2022.

[98] R. Hadidi, L. Nai, H. Kim, and H. Kim, “CAIRO: A Compiler-Assisted

Technique for Enabling Instruction-Level Offloading of Processing-In-

Memory,” ACM Transactions on Architecture and Code Optimization

(TACO), vol. 14, no. 4, pp. 1–25, 2017.

[99] B. Y. Cho, Y. Kwon, S. Lym, and M. Erez, “Near Data Acceleration

with Concurrent Host Access,” in Proceedings of the International

Symposium on Computer Architecture (ISCA), 2020.

[100] C. Giannoula, N. Vijaykumar, N. Papadopoulou, V. Karakostas,

I. Fernandez, J. Gómez-Luna, L. Orosa, N. Koziris, G. Goumas, and

O. Mutlu, “SynCron: Efficient Synchronization Support for Near-

Data-Processing Architectures,” in Proceedings of the International

Symposium on High-Performance Computer Architecture (HPCA),

2021.

[101] S. Chen, Y. Jiang, C. Delimitrou, and J. F. Martı́nez, “PIMCloud:

QoS-Aware Resource Management of Latency-Critical Applications

in Clouds with Processing-in-Memory,” in Proceedings of the In-

ternational Symposium on High-Performance Computer Architecture

(HPCA), 2022.

[102] V. Seshadri, D. Lee, T. Mullins, H. Hassan, A. Boroumand, J. Kim,

M. A. Kozuch, O. Mutlu, P. B. Gibbons, and T. C. Mowry, “Ambit:

In-Memory Accelerator for Bulk Bitwise Operations Using Commodity

DRAM Technology,” in Proceedings of the International Symposium

on Microarchitecture (MICRO), 2017.

[103] J. S. Kim, M. Patel, H. Hassan, L. Orosa, and O. Mutlu, “D-RaNGe:

Using Commodity DRAM Devices to Generate True Random Numbers

with Low Latency and High Throughput,” in Proceedings of the

International Symposium on High-Performance Computer Architecture

(HPCA), 2019.

[104] A. Olgun, M. Patel, A. G. Yağlıkçı, H. Luo, J. S. Kim, F. Nisa Bostancı,

N. Vijaykumar, O. Ergin, and O. Mutlu, “QUAC-TRNG: High-

Throughput True Random Number Generation Using Quadruple Row

Activation in Commodity DRAM Chips,” in Proceedings of the

International Symposium on Computer Architecture (ISCA), 2021.

[105] N. Hajinazar, G. F. Oliveira, S. Gregorio, J. a. D. Ferreira, N. M.

Ghiasi, M. Patel, M. Alser, S. Ghose, J. Gómez-Luna, and O. Mutlu,

“SIMDRAM: A Framework for Bit-Serial SIMD Processing Using

14[106]

[107]

[108]

[109]

[110]

[111]

DRAM,” in Proceedings of the International Conference on Architec-

tural Support for Programming Languages and Operating Systems

(ASPLOS), 2021.

X. Xin, Y. Zhang, and J. Yang, “ELP2IM: Efficient and Low Power

Bitwise Operation Processing in DRAM,” in Proceedings of the

International Symposium on High-Performance Computer Architecture

(HPCA), 2020.

V. Seshadri, Y. Kim, C. Fallin, D. Lee, R. Ausavarungnirun, G. Pekhi-

menko, Y. Luo, O. Mutlu, P. B. Gibbons, M. A. Kozuch, and T. C.

Mowry, “RowClone: Fast and Energy-Efficient In-DRAM Bulk Data

Copy and Initialization,” in Proceedings of the International Symposium

on Microarchitecture (MICRO), 2013.

K. K. Chang, P. J. Nair, D. Lee, S. Ghose, M. K. Qureshi, and

O. Mutlu, “Low-Cost Inter-Linked Subarrays (LISA): Enabling Fast

Inter-Subarray Data Movement in DRAM,” in Proceedings of the

International Symposium on High-Performance Computer Architecture

(HPCA), 2016.

J. Gómez-Luna, I. El Hajj, I. Fernandez, C. Giannoula, G. F. Oliveira,

and O. Mutlu, “Benchmarking Memory-Centric Computing Systems:

Analysis of Real Processing-in-Memory Hardware,” in International

Green and Sustainable Computing Conference (IGSC), 2021.

C. Giannoula, I. Fernandez, J. Gómez-Luna, N. Koziris, G. Goumas,

and O. Mutlu, “SparseP: Towards Efficient Sparse Matrix Vector Mul-

tiplication on Real Processing-In-Memory Architectures,” Proceedings

of the ACM on Measurement and Analysis of Computing Systems,

vol. 6, no. 1, pp. 1–49, 2022.

G. F. Oliveira, J. Gómez-Luna, S. Ghose, A. Boroumand, and O. Mutlu,

“Accelerating Neural Network Inference with Processing-in-DRAM:

From the Edge to the Cloud,” IEEE Micro, vol. 42, no. 6, pp. 25–38,

2022.

[112] J. Gómez-Luna, Y. Guo, S. Brocard, J. Legriel, R. Cimadomo, G. F.

Oliveira, G. Singh, and O. Mutlu, “An Experimental Evaluation of

Machine Learning Training on a Real Processing-in-Memory System,”

in arxiv.org, 2022.

[113] D. Lavenier, R. Cimadomo, and R. Jodin, “Variant Calling Paralleliza-

tion on Processor-in-Memory Architecture,” in IEEE International

Conference on Bioinformatics and Biomedicine (BIBM), 2020.

[114] D. Lavenier, J.-F. Roy, and D. Furodet, “DNA Mapping Using

Processor-in-Memory Architecture,” in IEEE International Conference

on Bioinformatics and Biomedicine (BIBM), 2016.

[115] S. Xu, X. Chen, Y. Wang, Y. Han, X. Qian, and X. Li, “PIMSim:

A Flexible and Detailed Processing-in-Memory Simulator,” IEEE

Computer Architecture Letters, vol. 18, no. 1, pp. 6–9, 2018.

[116] “ZSim+Ramulator - A Processing-in-Memory Simulation Framework,”

2019.

[117] G. Singh, J. Gómez-Luna, G. Mariani, G. F. Oliveira, S. Corda, S. Stu-

ijk, O. Mutlu, and H. Corporaal, “NAPEL: Near-Memory Computing

Application Performance Prediction via Ensemble Learning,” in Design

Automation Conference (DAC), 2019.

[118] D. Sanchez and C. Kozyrakis, “ZSim: Fast and Accurate Microarchi-

tectural Simulation of Thousand-Core Systems,” in Proceedings of the

International Symposium on Computer Architecture (ISCA), 2013.

[119] X. Xie, P. Gu, J. Huang, Y. Ding, and Y. Xie, “MPU-Sim: A Simulator

for In-DRAM Near-Bank Processing Architectures,” IEEE Computer

Architecture Letters, vol. 21, no. 1, pp. 1–4, 2021.

[120] C. Yu, S. Liu, and S. Khan, “MultiPIM: A Detailed and Configurable

Multi-Stack Processing-In-Memory Simulator,” IEEE Computer Archi-

tecture Letters, vol. 20, no. 1, pp. 54–57, 2021.