Summary of DISTAL Distributed Tensor Algebra Compiler

Summary DISTAL Distributed Tensor Algebra Compiler arxiv.org

13,481 words - PDF document - View PDF document

One Line

DISTAL is a high-performance distributed tensor algebra compiler that optimizes computation and data distribution in heterogeneous systems, generating competitive code for matrix multiplication and achieving near-peak utilization on GPUs.

Key Points

DISTAL is a distributed tensor algebra compiler that optimizes computation and data distribution in distributed systems.
It allows users to describe tensor and computation mapping independently.
DISTAL supports a distributed task-based runtime system and can compile a tensor algebra domain-specific language for modern distributed and heterogeneous systems.
The compiler generates code for GPUs, optimizing and scheduling variables.
DISTAL provides a concise and flexible way to describe tensor distribution across machines, enabling efficient computation on modern high-performance systems.

Summaries

297 word summary

DISTAL is a distributed tensor algebra compiler that optimizes computation and data distribution in distributed systems. It decomposes tensor algebra expressions into distributed matrix multiplication and transposition operations. DISTAL generates optimized code for GPUs and manages non-uniform memory access costs. It includes a data distribution language, a compiler for tensor computations, and an extension of the TACO runtime system. DISTAL provides a concise and flexible way to describe tensor distribution across machines, enabling efficient computation on modern high-performance systems. The core abstractions include modeling modern machines, using dimension variables, tensor distribution, and data distribution. The compiler allows users to specify the sparse format of tensors and introduces three new scheduling commands for computations. It achieves a speedup over existing systems on dense matrix-matrix multiplication. The DISTAL Compiler enables efficient communication and optimization in tensor operations, supporting hierarchical machine models, data distribution, distributed reductions, and the owner-computes paradigm for distributed computations. It models a distributed machine as a grid of processors and helps with tensor distribution on distributed systems. It introduces the concepts of aggregation and rotation to improve performance and supports various algorithms for rectangular matrices with tiled distribution. The DISTAL Distributed Tensor Algebra Compiler (DISTAL) is a high-performance system that optimizes tensor operations by implementing a tensor distribution notation and supporting various scheduling operations. It focuses on dense tensor algebra for heterogeneous machines, allowing users to create distributed implementations of tensor computations with various formats. DISTAL outperforms other systems and generates competitive code for matrix multiplication. It incorporates parallel universal matrix multiplication algorithms and is compatible with distributed memory concurrent computers. DISTAL aims to generate sophisticated distributed algorithms without extensive user input. It performs well on both CPUs and GPUs, achieving near-peak utilization on GPUs by keeping data in GPU framebuffer memory and communicating via NVLink.

584 word summary

DISTAL is a compiler that focuses on dense tensor algebra for heterogeneous machines. It allows users to create distributed implementations of tensor computations with various formats. DISTAL outperforms other systems and generates competitive code for matrix multiplication. It incorporates parallel universal matrix multiplication algorithms and is compatible with distributed memory concurrent computers. DISTAL aims to generate sophisticated distributed algorithms without extensive user input. In comparison to the Cyclops Tensor Framework (CTF), DISTAL offers similar capabilities but has some shortcomings. DISTAL performs well on both CPUs and GPUs, achieving near-peak utilization on GPUs. It keeps data in GPU framebuffer memory and communicates via NVLink. The DISTAL Distributed Tensor Algebra Compiler (DISTAL) is a system that optimizes tensor operations by implementing a tensor distribution notation and supporting various scheduling operations. It achieves high performance on different kernels, utilizing GPUs and CPUs for matrix multiplication tasks. The compiler optimizes task placement and partitioning through a mapper and bounds analysis procedure.

DISTAL introduces the concepts of aggregation and rotation to improve performance and supports various algorithms for rectangular matrices with tiled distribution. It utilizes a 2D grid and scheduling techniques similar to SUMMA, Cannon's, and Johnson's algorithms. Examples, pseudocode, and visualizations are provided to illustrate these concepts.

DISTAL models a distributed machine as a grid of processors, allowing users to express a virtual machine organization and expose locality in the model. It helps with tensor distribution on distributed systems by allowing users to map tensors onto machines using a tensor distribution notation statement. The tool supports replication of tensor tiles and communication between different parts of the machine hierarchy. Iteration spaces, distribute operation, and execution spaces are discussed to explain the mapping of computations. Examples of tensor algebra expressions and optimal computation methods are provided.

The DISTAL Compiler enables efficient communication and optimization in tensor operations, supporting hierarchical machine models, data distribution, distributed reductions, and the owner-computes paradigm for distributed computations.

DISTAL decomposes tensor algebra expressions into distributed matrix multiplication and transposition operations. It provides abstractions for defining data and computation distribution, enabling independent optimization and adaptability to the machine through loop transformation-based scheduling. DISTAL manages non-uniform memory access costs within a single compute node, targeting supercomputers.

DISTAL is a distributed tensor algebra compiler that optimizes computation and data distribution in distributed systems. Users can independently describe tensor and computation mapping. The generated code is competitive with optimized codes for matrix multiplication on multi-core CPUs and multiple GPUs. DISTAL supports a distributed task-based runtime system and can compile a tensor algebra domain-specific language for modern distributed and heterogeneous systems.

1148 word summary

DISTAL is a distributed tensor algebra compiler that optimizes computation and data distribution in distributed systems. It allows users to describe tensor and computation mapping independently. The generated code is competitive with optimized codes for matrix multiplication on multi-core CPUs and multiple GPUs. DISTAL supports a distributed task-based runtime system and can compile a tensor algebra domain-specific language for modern distributed and heterogeneous systems.

DISTAL decomposes tensor algebra expressions into distributed matrix multiplication and transposition operations. It provides abstractions for defining data and computation distribution, allowing for independent optimization. It can adapt data or computation distribution to the machine through loop transformation-based scheduling. DISTAL manages non-uniform memory access costs between multiple GPUs and CPU sockets within a single compute node. The output of DISTAL targets supercomputers.

The compiler generates code for GPUs, optimizing and scheduling variables. Communication occurs between processors, and the k loop is split into chunks. The i and j tiles are distributed over all GPUs, and the computation is mapped onto the target machine. The specific contributions of this work include a data distribution language, a compiler for tensor computations, and an implementation of DISTAL that extends the TACO runtime system. DISTAL uses Legion programs to interface with a mapper for data and computation placement. It allows users to specialize computation to target machines and offers optimization for data movement.

The DISTAL compiler allows users to specify the sparse format of tensors and introduces three new scheduling commands for computations. It can generate a fused kernel for the entire tensor index notation. Computation is described using tensor index notation, with background information on each component of DISTAL provided. DISTAL achieves a speedup over existing systems on dense matrix-matrix multiplication. The document includes an overview of DISTAL's implementation and contributions.

In summary, DISTAL provides a concise and flexible way to describe tensor distribution across machines, enabling efficient computation on modern high-performance systems. The core abstractions of DISTAL include modeling modern machines and using dimension variables, tensor distribution, and data distribution. These abstractions allow users to map data and computation onto distributed machines. DISTAL is a tool that models a distributed machine as a grid of processors, allowing users to express a virtual machine organization and expose locality in the model. It helps with tensor distribution on distributed systems by allowing users to map tensors onto machines using a tensor distribution notation statement. The tool supports replication of tensor tiles and communication between different parts of the machine hierarchy. The document discusses the concept of iteration spaces and their mapping onto execution spaces, introduces the distribute operation to transform iterations, and explains execution spaces that model the computation process. It provides examples of tensor algebra expressions and optimal computation methods. The document also explains tensor distribution, including hierarchical data distributions, and provides examples of tensor distributions and their mappings. The DISTAL Compiler enables efficient communication and optimization in tensor operations, supports hierarchical machine models and data distribution, and provides commands for distributing variables and dimensions onto processors in the machine. The compiler supports hierarchical machine models, data distribution, distributed reductions, and the owner-computes paradigm for distributed computations. The DISTAL Distributed Tensor Algebra Compiler is a tool that optimizes communication and memory usage in tensor operations. It introduces the concepts of aggregation and rotation to improve performance. The compiler supports various algorithms and can handle rectangular matrices with tiled distribution. It utilizes a 2D grid and scheduling techniques similar to SUMMA, Cannon's, and Johnson's algorithms. The document provides examples, pseudocode, and visualizations to illustrate these concepts. Solomonik's, PUMMA, Cannon's, and Johnson's algorithms are discussed in detail, with communication patterns and pseudocode provided for Cannon's and Johnson's algorithms. The DISTAL Distributed Tensor Algebra Compiler (DISTAL) is a system that implements a tensor distribution notation and supports various scheduling operations. It uses GPUs and CPUs for matrix multiplication tasks and compares algorithms such as COSMA, ScaLAPACK, CTF, and SUMMA. DISTAL achieves high absolute performance on different kernels, with experiments conducted on the Lassen supercomputer using NVIDIA Volta V100 GPUs and IBM Power9 CPUs. The system optimizes task placement and partitioning through a mapper and bounds analysis procedure. In terms of performance on matrix multiplication tasks, DISTAL performs well on both CPUs and GPUs. On CPUs, it performs equally or better than other systems, especially with 4 cores per node. On GPUs, DISTAL's performance is competitive, with all of its kernels achieving twice the performance of COSMA. The system keeps data in GPU framebuffer memory and communicates via NVLink, achieving near-peak utilization. DISTAL is compared to the Cyclops Tensor Framework (CTF) in terms of generality and is found to offer similar capabilities. However, there are some shortcomings to be addressed in the future. DISTAL's kernels perform worse than COSMA and have larger performance variations due to communication costs. Solomonik's 2.5D algorithm outperforms Johnson's algorithm on non-square node counts. COSMA achieves high performance when the number of GPUs is a perfect cube but performs worse for non-cubes. Cannon's algorithm performs better than SUMMA and PUMMA in terms of communication in the rectangular case. Overall, COSMA's performance varies depending on node counts, with 2D algorithms performing well at square node counts and achieving peak GFLOP/s on a single GPU. The DISTAL Distributed Tensor Algebra Compiler (DISTAL) is a system that focuses on separating data and computation distributions. It utilizes the same strategy as CTF on a single node computation and scheduling leaf kernels is important for achieving peak utilization. It outperforms CTF in generating a bespoke implementation for a target kernel and reshaping tensors. DISTAL provides a framework for extending relational database engines to support distributed algebra algorithms and implements specialized algorithms for tensor algebra operations. The GPU implementation of the inner-product kernel in DISTAL has similar performance characteristics as the CPU implementation. DISTAL also outperforms CTF in terms of bandwidth and execution time. It allows for the expression of various distributed algorithms and tensor distributions, and it combines data distribution descriptions with computation scheduling. Unlike other systems, DISTAL aims to generate sophisticated distributed algorithms without requiring extensive user input. DISTAL is a compiler for dense tensor algebra that targets modern, heterogeneous machines. It allows for the independent specification of computation and data distribution, enabling users to create distributed implementations of any desired tensor computation with any set of tensor formats. DISTAL outperforms existing systems and generates competitive code for matrix multiplication. The compiler focuses on tensor algebra compilation and distributed numerical and machine learning computations, including tensor contractions on GPUs. It automatically generates communication code for distributed programs and is designed to optimize parallel recursive rectangular matrix multiplication. DISTAL incorporates parallel universal matrix multiplication algorithms and is compatible with distributed memory concurrent computers. The paper cites various research papers and conference proceedings related to parallel computation, high-performance computing, and code generation for distributed memory machines. These references provide valuable insights and techniques in the field of distributed tensor algebra and high-performance computing.

2709 word summary

In this document, several references are cited that are related to distributed tensor algebra compilers and high-performance computing. The references include papers on performance graph DSL, framework-agnostic high-performance machine learning, tensor comprehensions, domain-specific languages and high-level frameworks, massively parallel tensor contraction, communication-optimal sparse tensor algebra, distributed tensor computations, Halide language and compiler, scalable linear algebra on a relational database system, high-performance Fortran, Red-Blue Pebbling and applications, and tensor decompositions. These references provide valuable insights and techniques in the field of distributed tensor algebra and high-performance computing. The DISTAL Distributed Tensor Algebra Compiler is a program that focuses on tensor algebra compilation and distributed numerical and machine learning computations. It includes tensor contractions on GPUs and generates communication code automatically for distributed programs. The compiler is designed to optimize parallel recursive rectangular matrix multiplication and utilizes optimal parallel programming models. The program also incorporates parallel universal matrix multiplication algorithms and is compatible with distributed memory concurrent computers. The document discusses the DISTAL Distributed Tensor Algebra Compiler, a scalable linear algebra library for distributed memory concurrent computers. It also mentions other relevant works such as ScaLAPACK and TVM, which are optimizing compilers for deep learning. The paper cites various research papers and conference proceedings related to parallel computation, high-performance computing, and code generation for distributed memory machines. The work was supported by the Department of Energy and the National Nuclear Security Administration. Rohan Yadav, Alex Aiken, and Fredrik Kjolstad have developed DISTAL, a compiler for dense tensor algebra that targets modern, heterogeneous machines. DISTAL allows for the independent specification of desired computation and data distribution, enabling users to create distributed implementations of any desired tensor computation with any set of tensor formats. It outperforms existing systems and generates code competitive with hand-optimized implementations of matrix multiplication.

Future work includes extending DISTAL with support for sparse tensors, exploring auto-scheduling and auto-formatting frameworks, and investigating its potential applications in training and evaluating distributed deep learning models. The static-dynamic approach of DISTAL allows for the expression of complex communication patterns and data distributions statically, while discharging lower level data movement operations to a runtime system. This design decision provides flexibility in expressing different algorithms and adaptability when integrating with existing codes. DISTAL is a distributed tensor algebra compiler that focuses on separating data and computation distributions. It allows for the expression of various distributed algorithms and tensor distributions. DISTAL is the first system of its kind and differs from previous work by Bondhugula and Amarasinghe et al. DISTAL utilizes static analysis to determine communication partners and generates runtime calls to complement the distribution of processors. It starts from a higher level representation that enables the expression of communication information for computation distribution. DISTAL combines data distribution descriptions with computation scheduling and supports different distributed layouts. Unlike other systems like Tiramisu, DISTAL does not have a data distribution language or a rotate command. It aims to generate sophisticated distributed algorithms without requiring extensive user input. Other systems that support targeting distributed machines include Distributed Halide and Tiramisu. DSL compilers have been developed for single-node linear algebra and distributed tensor computations. DISTAL provides a framework for extending relational database engines to support distributed algebra algorithms. Distributed algorithms for tensor algebra can improve upon the interpreted approach used by the Cyclops Tensor Framework (CTF). The Legion team plans to address the performance shortcomings of their system when dealing with replicated regions and inter-node communication. DISTAL implements specialized algorithms for tensor algebra operations and achieves high efficiency on both CPUs and GPUs. The GPU implementation of the inner-product kernel in DISTAL has similar performance characteristics as the CPU implementation. DISTAL also outperforms CTF in terms of bandwidth and execution time. The DISTAL schedule using matrix-multiplications improves performance for TTV, innerprod, and MTTKRP operations. The DISTAL Distributed Tensor Algebra Compiler (DISTAL) utilizes the same strategy as CTF on a single node computation. Leaf kernel performance heavily impacts the overall performance of a distributed computation. Scheduling leaf kernels is important for achieving peak utilization on a single node. The initial problem algorithm (MTTKRP) and distribution schedules were used for CPU and GPU kernels. Input tensors were distributed in a row-major layout to minimize inter-node communication. CTF's GPU backend was not built, so only CPU results are reported. Performance was best at 40 ranks per node. Weak-scaling experiments showed that DISTAL outperforms CTF on each higher-order tensor expression. DISTAL's scheduling primitives provide a mechanism for future work to target when automatically scheduling computations for distribution. While CTF fully automates the distribution process, users must provide a schedule for their computations. DISTAL allows for the development of a schedule that implements an optimal strategy for each kernel. It outperforms CTF in generating a bespoke implementation for a target kernel and reshaping tensors. CTF decomposes arbitrary tensor operations into calls to polyadic decompositions of tensors. These operations, such as TTM and MTTKRP, have real-world applications. We are considering tensor expressions and evaluating the generality of our system, DISTAL. We compare it to the Cyclops Tensor Framework (CTF) and find that DISTAL offers similar generality. The Legion team plans to address some shortcomings in the future. Our kernels perform worse than COSMA and experience larger performance variations due to communication costs. Solomonik's 2.5D algorithm achieves better performance than Johnson's algorithm on non-square node counts. Our COSMA implementation achieves high performance when the number of GPUs is a perfect cube, but for non-cubes, it achieves worse performance. Cannon's algorithm outperforms SUMMA and PUMMA in terms of communication in the rectangular case. Overall, the performance of COSMA varies depending on node counts. The 2D algorithms perform well at square node counts and achieve peak GFLOP/s on a single GPU. The DISTAL Distributed Tensor Algebra Compiler is compared to other systems in terms of performance on matrix multiplication tasks. The experiments are conducted on both CPUs and GPUs. DISTAL's kernels keep data in GPU framebuffer memory and communicate via NVLink, achieving near-peak utilization. On CPUs, DISTAL performs equally or better than other systems, with the best performance achieved with 4 cores per node. On GPUs, DISTAL's performance is also competitive, with all of its kernels achieving twice the performance of COSMA. The results show that DISTAL is a promising compiler for distributed tensor algebra computations. DISTAL is a distributed tensor algebra compiler that utilizes GPUs and CPUs to perform matrix multiplication. The system compares different algorithms, including COSMA, ScaLAPACK, CTF, and SUMMA. The evaluation shows that DISTAL achieves high absolute performance on various kernels. The experiments were conducted on the Lassen supercomputer using NVIDIA Volta V100 GPUs and IBM Power9 CPUs. The code was compiled with GCC 8.3.1 -O3 and CUDA 11.1. The system utilizes a mapper and bounds analysis procedure to optimize task placement and partitioning. Our process for lowering to Legion involves using communication and GASNet-EX for inter-node communication. Legion handles the movement of data through specialized channels and manages the allocation of memory regions. Communication in Legion is implicit, and the desired data and memory allocations are described through Legion's mapping interface. Tasks are the unit of computation in Legion, and regions are used to represent distributed data structures. We follow a similar process as TACO for GPU lowering. Legion performs dynamic analysis for communication and supports features necessary for high performance on modern machines. The implementation involves steps such as constructing a concrete index notation statement and translating it into Legion's API. The text excerpt is discussing the implementation of a tensor distribution notation and the scheduling operations supported in the Distributed Tensor Algebra Compiler (DISTAL). It mentions the placement of tensors into a distribution, the lowering of tensor distribution notation, and the communication and rotation operations. It also describes the divide and distribute transformations, as well as the concrete index notation used in the compiler. The text highlights that the DISTAL compiler implements the distribution layer of COSMA and supports various scheduling operations. Solomonik's algorithm is similar to 2D algorithms and operates on a processor cube. It uses a broadcast to communicate matrices and applies systolic patterns. The PUMMA algorithm is a hybrid of Solomonik's Algorithm and COSMA. Johnson's algorithm distributes matrices to different faces of the processor cube and partitions them. The algorithm targets a 3D processor grid. Cannon's algorithm rotates each processor's iteration space and changes the communication pattern. Figure 12 illustrates the communication pattern in Cannon's algorithm. The pseudocode for Cannon's and Johnson's algorithms is shown in Figures 11 and 13, respectively. The DISTAL Distributed Tensor Algebra Compiler is a tool that allows for efficient computation of tensor operations. It utilizes a schedule similar to SUMMA's algorithm, where processors shift tiles along their rows and columns. Cannon's algorithm is also used, where processors broadcast chunks of data to other processors in their rows and columns. The communication pattern and target machine organization are organized as a 2D grid. The compute statement is scheduled using SUMMA, with each processor performing local matrix multiplications on chunks of data. The dimension of the iteration space is distributed, and communication is scheduled accordingly. The algorithm is represented by a set of matrix-multiplication algorithms that can be implemented using DISTAL. The document includes pseudocode examples and mentions various commands used in the scheduling process. DISTAL is a distributed tensor algebra compiler. It supports various algorithms like SUMMA, Cannon's, and Johnson's. The algorithms are implemented using a 2D grid and communication patterns are optimized to reduce communication. The compiler can handle rectangular matrices and supports tiled distribution of input matrices. The techniques used in DISTAL are not specific to matrix-multiplication and can be applied to other tensor algebra operations as well. The DISTAL Distributed Tensor Algebra Compiler introduces the concept of rotate, which allows for the same iteration of a loop to occur at different times. This is achieved by rotating the execution space in time so that each processor executes the same iteration at the same point in time. The rotate operation is used to optimize the communication pattern in distributed algorithms and can improve performance by avoiding contention for the same pieces of data. The communicate command is not necessary for correctness but can be used to further optimize performance. The scheduling language used in DISTAL only affects performance, not correctness. The document provides examples and visualizations to illustrate the concepts of rotate and systolic communication patterns. The DISTAL Distributed Tensor Algebra Compiler allows for efficient communication and optimization in tensor operations. The communicate(T, i) command aggregates communication for each tensor into a single message, allowing for optimization of memory usage and tradeoff between memory usage and communication frequency. The choice of how much communication to aggregate affects the execution space and can be visualized in Figure 7b. Communication operations can be made more efficient by aggregating them into larger operations that fetch the data from memory. The communicate command is automatically inserted at each iteration space point where data needs to be communicated.

The DISTAL compiler also supports hierarchical machine models and data distribution. The distribute command can be applied hierarchically to match machine models and distribute data into the output tensor, trading space usage for increased parallelism. The compiler also supports distributed reductions and owner-computes paradigm for distributed computations.

The document provides code examples and commands for distributing variables and dimensions onto processors in the machine. The reorder and divide commands are used to map iteration space dimensions onto each processor. The compound command demonstrates the use of distribute, divide, and reorder commands to tile the iteration space dimensions onto a machine.

Overall, DISTAL Compiler enables efficient communication and optimization in tensor operations, supports hierarchical machine models and data distribution, and provides commands for distributing variables and dimensions onto processors in the machine. The document discusses the DISTAL Distributed Tensor Algebra Compiler. It explains the concept of iteration spaces and how they are mapped onto execution spaces. The distribute operation is introduced as a way to transform the execution of iterations. The document also mentions the execution spaces and how they model the computation process. It provides an example of a tensor algebra expression and discusses the optimal way to compute it. The concept of tensor distribution is explained, including hierarchical data distributions. The document concludes with examples of tensor distributions and their corresponding mappings. The DISTAL Distributed Tensor Algebra Compiler is a tool that helps with tensor distribution on distributed systems. The tool allows for the mapping of tensors onto machines in different ways. It uses a tensor distribution notation statement to define the mapping. The statement consists of a tensor, a mapping function, and a machine. The mapping function assigns coordinates of the tensor to processors in the machine. The machine can have multiple dimensions, and the mapping function can specify fixed dimensions, broadcasted dimensions, and colored dimensions. The tool also supports replication of tensor tiles and communication between different parts of the machine hierarchy. The DISTAL Distributed Tensor Algebra Compiler is a tool that allows users to express how tensors are distributed across machines. It uses a tensor distribution notation to describe the mapping of tensor dimensions to machine dimensions. The notation includes statements that partition tensor dimensions across machine dimensions and can fix the partition or broadcast it. The syntax for tensor distribution notation is described in Figure 4.

DISTAL models a distributed machine as a multidimensional grid of abstract processors. Each processor has associated local memory and can communicate with all other processors. This grid abstraction allows users to express a virtual machine organization and expose locality in the model.

The core abstractions of DISTAL include the modeling of modern machines and the use of dimension variables, tensor distribution, and data distribution. These abstractions allow users to map data and computation onto a distributed machine.

Overall, DISTAL provides a concise and flexible way to describe how tensors are distributed across machines, allowing for efficient computation on modern high-performance systems. The DISTAL Distributed Tensor Algebra Compiler is described in a document. The compiler allows users to specify the sparse format of tensors and introduces three new scheduling commands for computations. It can generate a single fused kernel for the entire tensor index notation. Computation is described using tensor index notation, and the document provides background information on each of the three components of DISTAL. DISTAL's performance is compared to other systems, and it achieves a speedup over existing systems on dense matrix-matrix multiplication. The document also includes an overview of DISTAL's implementation and contributions. The DISTAL Distributed Tensor Algebra Compiler is discussed in this summary. The compiler generates code for GPUs and allows for optimization and scheduling of variables. Communication occurs between processors, and the k loop is split into chunks. Each i and j tile is distributed over all GPUs, and the computation is mapped onto a target machine. The specific contributions of this work include a data distribution language, a compiler for tensor computations, and an implementation of DISTAL that extends the TACO runtime system. The compiler generates Legion programs that interface with a mapper to place data and computation onto memories and processors. DISTAL lets users specialize computation to target machines and offers optimization for data movement. Libraries like ScaLAPACK offer similar functionality for distributed machines. DISTAL is a distributed tensor algebra compiler that allows users to generate bespoke implementations of tensor algebra expressions. It can decompose tensor algebra expressions into distributed matrix multiplication and transposition operations. DISTAL provides abstractions for defining the distribution of both data and computation, allowing for independent optimization. It also allows for adapting the data or computation distribution to the machine through loop transformation-based scheduling. DISTAL can be used to create implementations of any dense tensor algebra expression for modern heterogeneous systems. It manages the non-uniform memory access costs between multiple GPUs and CPU sockets within a single compute node. The output of DISTAL is code targeting supercomputers. DISTAL is a distributed tensor algebra compiler that aims to optimize the computation and data distribution of tensor algebra kernels in distributed systems. It allows users to independently describe how tensors and computation map onto target machines. The code generated by DISTAL is competitive with optimized codes for matrix multiplication on distributed systems with multi-core CPUs and multiple GPUs. DISTAL supports a distributed task-based runtime system and can compile a tensor algebra domain-specific language to target modern distributed and heterogeneous systems.

Raw indexed text (82,809 chars / 13,481 words / 2,123 lines)

DISTAL: The Distributed Tensor Algebra Compiler

Rohan Yadav Alex Aiken Fredrik Kjolstad

Stanford University

Stanford, CA, USA

[email protected] Stanford University

Stanford, CA, USA

[email protected] Stanford University

Stanford, CA, USA

[email protected]

Abstract

We introduce DISTAL, a compiler for dense tensor algebra

that targets modern distributed and heterogeneous systems.

DISTAL lets users independently describe how tensors and

computation map onto target machines through separate for-

mat and scheduling languages. The combination of choices

for data and computation distribution creates a large design

space that includes many algorithms from both the past (e.g.,

Cannon’s algorithm) and the present (e.g., COSMA). DIS-

TAL compiles a tensor algebra domain specific language to

a distributed task-based runtime system and supports nodes

with multi-core CPUs and multiple GPUs. Code generated

by DISTAL is competitive with optimized codes for matrix

multiply on 256 nodes of the Lassen supercomputer and out-

performs existing systems by between 1.8x to 3.7x (with a

45.7x outlier) on higher order tensor operations.

CCS Concepts: • Software and its engineering → Source

code generation; Domain specific languages; • Mathe-

matics of computing → Mathematical software perfor-

mance.

Keywords: Compilers, Distributed Systems, High Perfor-

mance Computing

ACM Reference Format:

Rohan Yadav, Alex Aiken, and Fredrik Kjolstad. 2022. DISTAL: The

Distributed Tensor Algebra Compiler. In Proceedings of the 43rd

ACM SIGPLAN International Conference on Programming Language

Design and Implementation (PLDI ’22), June 13–17, 2022, San Diego,

CA, USA. ACM, New York, NY, USA, 15 pages. https://doi.org/10.

1145/3519939.3523437

Introduction

Tensor algebra kernels are key components of many work-

loads that benefit from the compute, memory bandwidth

and memory capacity offered in a distributed system. How-

ever, the implementation of distributed tensor algorithms

Permission to make digital or hard copies of part or all of this work for

personal or classroom use is granted without fee provided that copies are

not made or distributed for profit or commercial advantage and that copies

bear this notice and the full citation on the first page. Copyrights for third-

party components of this work must be honored. For all other uses, contact

the owner/author(s).

PLDI ’22, June 13–17, 2022, San Diego, CA, USA

ACM ISBN 978-1-4503-9265-5/22/06.

https://doi.org/10.1145/3519939.3523437

Expression

A(i, j) = B(i, k) · C(k, j)

A(i, l) = B(i, j, k) · C(j, l) · D(k, l)

a = B(i, j, k) · C(i, j, k)

A(i, j, l) = B(i, j, k) · C(k, l)

Supercomputer

A(i, j) = B(i, j, k) · c(k)

Data Distribution

Partition A into tiles

Replicate B onto all nodes

Place C onto only some nodes

DISTAL

CPU

GPU

Computation Distribution

Owner Computes

Distribute i,j loops

Communicate in chunks

Figure 1. Example of the user-specified input (tensor algebra

expressions, desired data and computation distribution) and

output (code targeting supercomputers) of DISTAL.

that are both correct and achieve high performance is a chal-

lenging task for most programmers. The situation is even

more daunting when taking into account the heterogeneity

within a single compute node; in many high performance

systems today, managing the non-uniform memory access

costs between the multiple GPUs and CPU sockets within a

node is another distributed systems challenge to solve.

We present DISTAL, a compilation-based system that pro-

vides novel abstractions to create implementations of any

dense tensor algebra expression for modern heterogeneous

machines. Figure 1 depicts how DISTAL lets users describe

how the data and computation of a tensor algebra expression

map onto a target machine. Figure 2 shows C++ code that

implements a multi-GPU distributed matrix-multiply using

DISTAL. Lines 4–15 map tensors onto the machine as part of

the tensors’ format, and lines 23–40 map computation onto

the machine through a loop transformation-based scheduling

language. By separating the specifications of data distribu-

tion and computation distribution, DISTAL allows for their

independent optimization, or for adapting either the data or

computation distributions to complement the other.

Defining the distribution of both data and computation

creates a design space of algorithms for each tensor compu-

tation. In particular, many algorithms from the literature are

expressible as data distributions and schedules in DISTAL,

including all of the algorithms in Figure 9 (Section 4). The ab-

stractions of DISTAL let these algorithms be expressed with

the expected asymptotic behavior and excellent practical per-

formance. In fact, our evaluation (Section 7) demonstrates

that implementations of these algorithms using DISTAL are

competitive with hand-tuned implementations.

Implementations of tensor algebra operations on modern

hardware can be hundreds to thousands of lines of low-levelPLDI ’22, June 13–17, 2022, San Diego, CA, USA

code that manage data movement between nodes and ac-

celerators. For example, a conservative estimate of the core

distributed matrix-multiplication logic in the implementa-

tion of the COSMA [25] algorithm by the original authors

is around 500 lines of code, excluding lower-level communi-

cation code, local GEMM operations, whitespace, and com-

ments. In contrast, the full data placement and distribution

related scheduling for a GEMM using DISTAL is 15 lines of

code (in Figure 2), while delivering competitive performance.

Alternatively to hand-coded implementations, state-of-

the-art distributed tensor algebra libraries such as the Cy-

clops Tensor Framework (CTF) [34] achieve generality by

decomposing tensor algebra expressions into a series of dis-

tributed matrix multiplication and transposition operations,

relying on the efficiency of a hand-written set of core imple-

mentations. These approaches cannot implement the best al-

gorithm for every tensor expression, as writing a hand-tuned

implementation for every situation is impractical. With DIS-

TAL, users can generate bespoke implementations of their

tensor expressions that implement either algorithms from

the literature or new algorithms tuned to a target machine.

Additionally, tensor algebra kernels that run on a dis-

tributed machine do not exist in a vacuum. These kernels

operate on and generate data in the context of a larger appli-

cation that imposes constraints on how the program’s data is

partitioned and distributed among different memories in the

target machine. Libraries such as ScaLAPACK [11] offer a set

of kernels that assume a specific set of input distributions

and require the user to reorganize their data into one of these

distributions, which can result in additional data movement.

In contrast, DISTAL lets users specialize computation to the

way that data is already laid out, or easily transform data

between distributed layouts to match the computation.

We implement DISTAL by extending the dense function-

ality in the TACO [22] compiler to target the Legion [5]

distributed runtime system, as shown in Figure 3. We ex-

tend the format and scheduling languages of TACO with

primitives for distribution. Next, we add analysis passes and

intermediate representation constructs to TACO to generate

Legion programs that interface with a mapper that places

data and computation onto memories and processors.

The specific contributions of this work are:

1. A data distribution language and set of scheduling com-

mands that can express a wide variety of distributed

tensor computations.

2. A compiler that combines the separate specifications for

how data and computation map onto distributed ma-

chines.

3. An implementation of DISTAL that extends the TACO [22]

compiler to target the Legion [5] runtime system.

We evaluate our contributions along two different axes:

Generality. We implement several tensor algebra kernels

and show that we achieve good performance. We show

Rohan Yadav, Alex Aiken, and Fredrik Kjolstad

// Declare input parameters for generated code.

Param gx, gy, n, chunkSize;

// Define the target machine m as a 2D grid of processors.

Machine m(Grid(gx, gy));

// A tensor's format describes how it is distributed onto m.

// The following format partitions the two dimensions of a tensor by

// the two dimensions of m, resulting in a two-dimensional tiling.

// The final argument declares the tensor should reside in GPU

// framebuffer memory for fast access from GPUs.

Distribution tiles(m, {0, 1}, Memory::GPU_MEM); // (x, y) -> m(x, y)

Format f({Dense, Dense}, tiles);

// Declare three dense matrices with the same format.

Tensor A({n, n}, f), B({n, n}, f), C({n, n}, f);

// Declare the computation, a matrix-matrix multiply.

IndexVar i, j, k;

A(i, j) = B(i, k) * C(k, j);

// Map the computation onto m via scheduling commands.

IndexVar io, ii, jo, ji, ko, ki;

A.schedule()

// Tile i and j for each GPU.

.divide(i, io, ii, m.x).divide(j, jo, ji, m.y)

.reorder({io, jo, ii, ji})

// Distribute each i and j tile over all GPUs.

.distribute(io).distribute(jo)

// Break the k loop into chunks.

.split(k, ko, ki, chunkSize)

// Move the outer k loop to outside the ii and ji loops.

.reorder({io, jo, ko, ii, ji, ki})

// Choose the granularity at which communication occurs.

// Here, each processor operates on a local piece of a, and

// receives chunks of b and c as the ko loop steps.

.communicate(a, jo).communicate({b, c}, ko)

// Schedule at leaves for variables ii, ji, and ki. Our

// system can generate code for GPUs, but allows for using

// heavily optimized kernels when applicable.

.substitute({ii, ji, ki}, CuBLAS::GeMM);

Figure 2. Multi-GPU matrix multiplication in DISTAL im-

plementing the SUMMA [35] algorithm used by ScaLA-

PACK [11].

Tensor Index Notation

Format Language

Primitives for data distribution

Contrib. 1

Scheduling Language

TACO

Scheduling

Transformation

Passes

Task Creation

Comm. Analysis

Contrib. 2

= Our Contributions

Low-level IR

Legion

Constructs

Contrib. 3

Legion Program

Legion

Runtime

Custom Legion Mapper

Contrib. 3

Primitives for computation distribution

Contrib. 1

Figure 3. Overview of DISTAL’s implementation.

that our approach of generating bespoke implementations

achieves between a 1.8x to 3.7x (with a 45.7x outlier) speedup

over CTF, a system aimed at a similar level of generality.

Absolute Performance. Our system matches or outper-

forms existing systems on dense matrix-matrix multiplica-

tion, a tensor algebra operation that has been extensively

optimized in prior work. In particular, dense matrix multipli-

cation code generated by DISTAL outperforms the CTF [34]

and ScaLAPACK [11] libraries by at least 1.25x and comes

within 0.95x the performance of COSMA [25], the best pub-

lished dense matrix-matrix multiplication implementation.

Background

Figure 2 shows DISTAL’s three input sub-languages: a com-

putation language that describes the desired kernel (linesDISTAL: The Distributed Tensor Algebra Compiler

18–19), a scheduling language that describes how to optimize

the computation (lines 23–40), and a format language that de-

scribes how the tensors are stored (lines 4–15). In this section,

we give background on each of these three components.

Computation is described in DISTAL using tensor index no-

tation, a domain specific language used as input to the TACO

compiler [22]. Tensor index notation consists of accesses that

index tensor dimensions with lists of variables. Tensor index

notation statements are assignments, where the left-hand

side is an access, and the right-hand side is an expression

constructed from the addition and multiplication of accesses.

For example, the tensor-times-vector operation is expressed

in tensor index notation as 𝐴(𝑖, 𝑗) = 𝑘 𝐵(𝑖, 𝑗, 𝑘) · 𝑐 (𝑘). Each

component 𝐴(𝑖, 𝑗) is the result of the inner-product of the

last dimension of 𝐵 with 𝑐. Index variables correspond to

nested loops, and variables used only on the right-hand side

represent sum reductions over their domain. Tensor index

notation allows for any number of tensors on the right-hand

side of an expression, and, like TACO, DISTAL can generate

a single fused kernel for the entire tensor index notation

expression.

The computation description is separated from the exact

algorithm to perform the computation through a scheduling

language [3, 10, 29, 32, 36, 37]. We provide the following

transformations introduced by prior systems [21, 29, 32]:

•

parallelize : parallelize the iterations of a loop

split / divide : break a loop into an inner and outer loop

collapse : fuse two nested loops into a single loop

reorder : switch the execution order of two loops

precompute : hoist the computation of a subexpression

We will introduce three new scheduling commands that de-

scribe how computations map onto a distributed machine.

Moreover, the TACO [22] compiler introduced a format

language that allows users to specify the sparse format of

each tensor in a computation. While this work considers

only dense computations, we take inspiration from TACO to

describe a tensor’s distribution as part of the tensor’s format.

Core Abstractions

This section describes the core abstractions of DISTAL that

allow users to express a virtual machine organization and to

map data and computation onto that machine.

3.1

Modeling Modern Machines

DISTAL models a distributed machine M as a multidimen-

sional grid of abstract processors that each have an asso-

ciated local memory and can communicate with all other

processors. The purpose of the grid abstraction is twofold: to

expose locality in the model (which may or may not exist in

the physical machine) and to match the grid-like structure

of tensor algebra computations.

A flat machine representation is useful, but is not sufficient

to model many modern high performance systems. These

PLDI ’22, June 13–17, 2022, San Diego, CA, USA

Machines M Tensors T

Dimension Name 𝑛

Tensor Distribution D

Dimension Variables

::= 𝑑 | N | ’∗’

::= T d + ↦→ n + M

𝑑

Figure 4. Syntax of Tensor Distribution Notation

systems are often heterogeneous, where each node contains

multiple accelerators and CPU sockets that offer faster com-

munication within a node than between nodes. 1 Therefore,

our machine abstraction is also hierarchical: each abstract

processor may itself be viewed as a distributed machine. We

use this abstraction in our evaluation to model the Lassen su-

percomputer; we arrange the nodes into multi-dimensional

grids, then model each node as a grid of GPUs.

3.2

Data Distribution

Users map tensors onto machines through a sub-language of

the format language called tensor distribution notation. Ten-

sor distribution notation allows users to describe how input

tensors are already distributed in their program or to move

data into a distributed layout that suits the computation.

Syntax. Figure 4 describes the syntax for tensor distribu-

tion notation, which encodes how dimensions (modes) of a

tensor T map onto the dimensions of a machine M. Tensor

distribution notation expresses this mapping by naming each

dimension of T and M; index sequences on the left and right

of the ↦→ name dimensions in T and M respectively. Tensor

dimensions that share names with machine dimensions are

partitioned across those machine dimensions. The remain-

ing machine dimensions broadcast the partition over the

dimensions, or fix the partition to a coordinate in the dimen-

sion. A tensor distribution notation statement T X ↦→ Y M,

where X and Y are index sequences, is valid if |𝑋 | = dim T ,

|𝑌 | = dim M, both 𝑋 and 𝑌 contain no duplicate names, and

all names in 𝑌 are present in 𝑋 .

Intuition. Tensor dimensions partitioned across machine

dimensions are divided into equal-sized contiguous pieces.

For example, if T and M are one-dimensional with 100 com-

ponents and 10 processors respectively, then the distribution

T x ↦→ x M maps 10 components of T to every processor of

M. Many common blocked partitioning strategies can be

expressed by the choice of tensor dimensions to partition.

Figure 5 displays multiple ways of partitioning a matrix, such

as by rows (T xy ↦→ x M, Figure 5b), by columns (T xy ↦→ y M),

or by two-dimensional tiles (T xy ↦→ xy M, Figure 5c). Tensor

dimensions that are not partitioned span their full extent in

each partitioned piece, as seen in Figure 5b and Figure 5f.

Machine dimensions that do not partition tensor dimen-

sions either fix the tensor to a single index or broadcast the

tensor across all indices of a dimension. A tensor is fixed

to a single machine index by naming the dimension with a

constant as in T xy ↦→ xy0 M in Figure 5d, which restricts the

1 Inter-node

communication is affected by hierarchy in the network itself.

For example, communication within a rack is faster than between racks.PLDI ’22, June 13–17, 2022, San Diego, CA, USA

Rohan Yadav, Alex Aiken, and Fredrik Kjolstad

T x !

x M

(a) Blocked distribution of a vector.

xy 7! xy0

xy 7! xy M

(b) Row-wise matrix distribution.

(d) Fix a partition to a dimension.

xy 7! x M

xy 7! xy* M

(e) Broadcast a partition over a dimension.

xyz 7! xy

(f) Map a 3-tensor onto a processor grid.

Figure 5. Examples of tensor distribution notation statements that map tensors onto machines in different ways.

tensor tiles to a face of the machine. Marking a dimension

with a ∗ replicates a tensor across that dimension, such as

T xy ↦→ xy* M in Figure 5e where tensor tiles are replicated

across the entire third dimension of M.

Semantics. Having described tensor distribution notation

intuitively, we now provide a more formal description. We

define a tensor T to be a set of coordinate–value pairs and a

machine M to be a set of coordinates. A tensor distribution

notation statement T X ↦→ Y M is a function that maps the

coordinates in T to a non-empty set of processor coordinates

in M, where 𝑋 and 𝑌 are index sequences. This function

is the composition of two functions P : T → color and

F : color → M set. P is an abstract partitioning function

that maps coordinates in T to unique colors. F then maps

each color in the range of P to processors in M. That is,

we first group coordinates of T into equivalence classes

(colors), and then map each equivalence class to a processor

(or processors, if broadcast) in M.

Let 𝑝 = 𝑋 ∩𝑌 , the dimensions of T that are partitioned by

dimensions of M. Concretely, a color is a point in the 𝑝 ⊆ 𝑌

dimensions of M. P’s coloring is a many-to-one mapping

between points in the 𝑝 dimensions of T and colors. The col-

oring is lifted to the remaining non-partitioned dimensions

of T in the natural way: every coordinate 𝑥 ∈ T is given

the same color as P (𝑥 𝑝 ), where 𝑥 𝑝 is 𝑥 restricted to the 𝑝

dimensions of T . We choose to use a blocked partitioning

function that maps contiguous ranges of coordinates to the

same color. However, other functions such as a cyclic dis-

tribution that maps adjacent coordinates to different colors

could also be used. As a running example, consider the dis-

tribution T xy ↦→ xy* M (Figure 5e), where T is 2x2 and M is

2x2x2. For this tensor distribution notation statement,

P = {(0, 0) ↦→ (0, 0), (0, 1) ↦→ (0, 1), (1, 0) ↦→ (1, 0), (1, 1) ↦→ (1, 1)},

mapping the coordinates in the 2x2 matrix onto points in

the first two dimensions of the 2x2x2 machine cube.

F maps P’s coloring of T to full coordinates of processors

in M. As discussed previously, each color is an assignment

to the 𝑝 ⊆ 𝑌 dimensions of M and can be mapped to a co-

ordinate of a processor in M by specifying an index for the

remaining 𝑌 − 𝑝 dimensions. Since all names in 𝑌 must be

present in 𝑋 as a condition of tensor distribution validity, the

remaining 𝑌 − 𝑝 dimensions must either fix or broadcast the

partition. F expands the color to the remaining dimensions

of M by casing on whether each dimension fixes or broad-

casts the partition: fixed dimensions are set to the target

value and broadcasted dimensions are expanded to all possi-

ble coordinates in the dimension. In the running example,

F = {(0, 0) ↦→ {(0, 0, 0), (0, 0, 1)}, (0, 1) ↦→ {(0, 1, 0), (0, 1, 1)},

(1, 0) ↦→ {(1, 0, 0), (1, 0, 1)}, (1, 1) ↦→ {(1, 1, 0), (1, 1, 1)}},

expanding the coloring of P to the third dimension of M.

Hierarchy. Data distributions can also be hierarchical to

match the hierarchical structure of the machine. If the target

machine has a hierarchical structure, then a tensor distribu-

tion can be provided for each level in the machine. For exam-

ple, if the machine M is organized as a 2-dimensional grid at

the node level, and then a 1-dimensional grid of GPUs within

each node, then the distribution [T xy ↦→ xy M, T zw ↦→ z M]

represents a two dimensional tiling of a matrix at the outer

level, and row-wise partition of each tile for each GPU.

3.3

Computation Distribution

Like a tensor distribution notation statement describes how

a tensor is distributed across a machine, a schedule describes

how the iteration space of an expression is transformed and

distributed. In this section, we introduce three new schedul-

ing transformations on top of those introduced in the back-

ground section: distribute, communicate, and rotate. The first

two were used in Figure Í 2. Throughout this section, we use

the computation 𝑎(𝑖) = 𝑗 𝑏 ( 𝑗) as a running example, which

has loop structure ∀ 𝑖 ∀ 𝑗 𝑎(𝑖) += 𝑏 ( 𝑗). This computation sets

each index of 𝑎 to be the sum of all elements in 𝑏. 2 In our ex-

amples, we consider a one-dimensional machine M, where

𝐴 x ↦→ x M, 𝐵 x ↦→ x M and |𝐴| = |𝐵| = |M | = 3.

Iteration Spaces. The iteration space of loops in a ten-

sor algebra expression is a hyper-rectangular grid of points

2 The

optimal way to compute this expression is to aggregate 𝑏 into a scalar

and assign the scalar to each index in 𝑎. For the sake of a simple example,

we present alternate (albeit arithmetically inefficient) schedules to illustrate

key concepts of DISTAL.P 2

(1, 0) (1, 1) (1, 2)

(2, 0) (2, 1) (2, 2)

Time

Figure 6. Execution space mapping of ∀ 𝑖 ∀ 𝑗 𝑎(𝑖) += 𝑏 ( 𝑗)

after distribute(i) . Iteration space points are labeled (𝑖, 𝑗).

formed by taking the Cartesian product of the iteration do-

main of each index variable in the input expression. Each

point in the iteration space represents a scalar operation that

is the atomic unit of computation in our model.

Execution Spaces. We model the execution of an itera-

tion space through an execution space, which describes when

and where each iteration space point is executed. An execu-

tion space has a processor dimension and a time dimension,

describing what processor an iteration space point executes

on and at what relative time the point executes. A mapping

of iteration space points onto an execution space describes

an execution strategy for the iteration space. Each processor

and point in time may be assigned one point 𝑝 in the iter-

ation space as well as communication operations to fetch

tensor data needed by 𝑝 that logically occur before 𝑝; we dis-

cuss communication operations below. An iteration space’s

default execution space mapping linearizes all points in time

according to the ordering of the iteration space dimensions

and maps all points to the same processor.

Distribute. The distribute operation transforms how the

iteration space maps onto the execution space. In particular,

distribute -ing a set of index variables 𝑣 modifies the execu-

tion space mapping such that all iterations of the dimensions

corresponding to 𝑣 occur on a different processors at the

same time, as seen in Figure 6. 3

The following compound command utilizes the distribute ,

divide and reorder commands to map a set of iteration space

dimensions onto a machine by tiling the iteration space di-

mensions onto each processor in the machine:

1 distribute(vector targets, vector dist,

vector local, Machine m):

for i in range(0, m.dim):

# Divide each dimension by the corresponding machine dimension.

divide(targets[i], dist[i], local[i], m.dims[i])

# Reorder loops so each outer divided variable is on the outside.

reorder(dist + local)

# Distribute all of the outer divided variables.

distribute(dist)

The choice of variables to distribute affects the resulting

communication patterns. Distributing variables that index

3 The distribution of an iteration space I onto a machine M can also be

viewed in a similar manner to tensor distribution notation, where the desired

𝑋 ∩ 𝑌 dimensions of I are mapped onto M using 𝐼 X ↦→ Y M.

P 0

[1] [1] P 1

[2] [2] P 2

(0, 0)

(1, 0)

(2, 0)

(0, 1)

(1, 1)

(2, 1)

Time

a b [0] [0] P 0

[1] [1] P 1

[2] [2] P 2

(0, 2)

(1, 2)

(2, 2)

b [0] a [0]

(0, 2)

(0, 1)

P 1

(0, 0)

P 0

PLDI ’22, June 13–17, 2022, San Diego, CA, USA

DISTAL: The Distributed Tensor Algebra Compiler

(0, 0) (0, 1) (0, 2)

(1, 0) (1, 1) (1, 2)

(2, 0) (2, 1) (2, 2)

Time

(a) Naïve completion where com- (b) Completion where communi-

munication is inserted as needed cation is aggregated underneath

at each iteration space point.

each 𝑖 iteration.

Figure 7. Completions of the distributed execution space

mapping of ∀ 𝑖 ∀ 𝑗 𝑎(𝑖) += 𝑏 ( 𝑗) where 𝑎 x ↦→ x M and 𝑏 x ↦→ x M.

the output tensor pull input tensors towards a stationary

output tensor in an owner-computes paradigm. Distributing

variables used for reductions results in distributed reductions

into the output, trading space usage for increased parallelism.

To match hierarchical machine models and data distri-

butions, distribute may also be applied hierarchically. For

example, we can apply this strategy in computations that

benefit from locality, like matrix-multiply, to use a distributed

algorithm at the node level and another (sometimes different)

algorithm for the multiple GPUs within a node.

Communicate. Every iteration space point maps to a

coordinate in each of the input and output tensors, corre-

sponding to the data required at that point. The data required

at each iteration space point may not be present in the mem-

ory of the processor the point is mapped to, and must be

communicated from a memory where the data resides. A

communication operation will be automatically inserted at

each execution space point where the required data is not

present, resulting in a naïve completion of the execution

space mapping. A naïve completion of the execution space

of ∀ 𝑖 ∀ 𝑗 𝑎(𝑖) += 𝑏 ( 𝑗) transformed by distribute(i) is depicted

in Figure 7a.

Communication operations can be made more efficient by

aggregating them into larger operations that fetch the data

for a group of iteration space points, as seen in Figure 7b.

The choice of how much communication to aggregate incurs

the tradeoff between memory usage and communication

frequency—more frequent communication allows for lower

memory usage at the cost of more messages sent.

To allow for optimization over this tradeoff space, the

communicate command controls how much communication

for each tensor should be aggregated into a single message.

Precisely, communicate(T, i) aggregates the communication

of T at the beginning of each iteration of i by materializing

the data for all iteration space points nested under each

iteration of i in the executing processor’s memory. If no

communicate command is given, then communication will be

nested under the inner-most index variable.PLDI ’22, June 13–17, 2022, San Diego, CA, USA

[0]

P 0

[1] [1] P 1

[2] [2] P 2

(0, 0) (0, 1) (0, 2)

(1, 0) (1, 1) (1, 2)

b[0]

b[1]

(2, 0)

(2, 1)

[0]

Time

(a) Standard execution space

mapping. At each time step, the

processor 𝑗 broadcasts 𝑏 [ 𝑗] to all

other processors.

P 0

[1] [1] P 1

[2] [2] P 2

(2, 2)

b[2]

Rohan Yadav, Alex Aiken, and Fredrik Kjolstad

(0, 0) (0, 1)

(0, 2)

(1, 1) (1, 2)

b[0]

b[1]

(1, 0)

b[1]

(2, 2)

b[2]

(2, 0)

b[2]

(2, 1)

b[1]

Time

(b) Rotated execution space map-

ping. At each time step, each

processor transfers the data re-

ceived at the previous time step.

Figure 8. Communication (denoted by dashed arrows)

between processors in execution space mappings for

∀ 𝑖 ∀ 𝑗 𝑎(𝑖) += 𝑏 ( 𝑗) with and without rotate .

It is a deliberate choice to omit any notion of processor

ranks, explicit sends/receives, and specific channels used

(such as in prior work [3]) from the scheduling language so

that schedules only affect performance, not correctness, as

well as to keep the scheduling language relatively simple.

The communicate command is not needed for correctness and

is used only to optimize the communication pattern of the

computation.

Rotate. Many distributed algorithms have a systolic com-

munication pattern, where processors repeatedly shift data

to their neighbors. Systolic algorithms can take advantage of

machine architectures with interconnects that offer higher

performance for nearest-neighbor communication and im-

prove performance by avoiding contention for the same

pieces of data. To express systolic computations, we in-

troduce the rotate operation, which acts as a symmetry-

breaking operation for distributed loops by transforming the

mapping onto the time dimension of the execution space.

To gain intuition for rotate , consider the execution space

mapping of the running example ∀ 𝑖 ∀ 𝑗 𝑎(𝑖) += 𝑏 ( 𝑗) trans-

formed by distribute(i) . At each time step, all processors

access (and issue communication requests for) the same ele-

ment of 𝐵, causing the owner of that element to broadcast

it to all other processors, as seen in Figure 8a. In contrast,

a systolic version of this computation would instead have

each processor accumulate the local element of 𝑏, and then

shift the local element to the processor to its left in M.

The systolic communication pattern is achieved by chang-

ing the point in time that each processor executes each it-

eration of the 𝑗 loop. If the mapping of the time dimension

is reordered such that for each iteration of 𝑖, the 𝑗 iteration

space is rotated in time so that the 𝑖th iteration of 𝑗 occurs

first, then the resulting execution space has a systolic com-

munication pattern, where each processor utilizes the data

that the processor to its right used in the previous itera-

tion. This effect is depicted in Figure 8b where no processors

execute the same iteration of 𝑗 at the same point in time.

Concretely, given a set of index variables 𝐼 , target index

variable 𝑡 and result index

Í variable 𝑟 , rotate(𝑡 , 𝐼 , 𝑟 ) rotates

each iteration of 𝑡 by 𝑖 ∈𝐼 𝑖 mod extent(𝑡). The effect of

rotate is that ∀𝑖 ∈ 𝐼 , given a fixed iteration for all remaining

𝑖 ′ ∈ (𝐼 − 𝑖), the same iteration of 𝑟 occurs at a different time

for all iterations of 𝑖. For example, if 𝑡 is rotated by variables

𝑖 and 𝑗 oriented in a 2D grid, every row starts the iteration

of 𝑟 at a unique value, and vice versa for each column.

Matrix-Multiplication Case Studies

Tensor distribution notation and scheduling can be com-

posed to express a wide variety of algorithms. A large body

of research on distributed tensor algebra focuses on algo-

rithms for distributed matrix-multiplication. Therefore, to

showcase the expressivity of our techniques, we perform a

case study on matrix-multiplication algorithms discussed in

the literature. However, our techniques are not specific to

matrix-multiplication and generalize to all of tensor algebra.

4.1

Distributed Matrix-Multiplication Background

The first distributed matrix-multiplication algorithm, pre-

sented by Cannon [8], uses a systolic communication pat-

tern and a tiled distribution of the input matrices. The

PUMMA [12] and SUMMA [35] algorithms extended this

work by generalizing to rectangular matrices and improv-

ing the communication patterns through pipelining. The

SUMMA algorithm is implemented in the widespread ScaLA-

PACK [11] library. These are called 2D algorithms, because

they organize the target machine into a 2D grid and decom-

pose the input matrices in tiles on the processor grid.

Follow up work by Agrawal et al. [1] introduced Johnson’s

algorithm, which organized processors into a 3D grid and

utilized extra memory per processor to perform asymptoti-

cally less communication than 2D algorithms. Algorithms

of this style are called 3D algorithms. The 2.5D algorithm

by Solomonik et al. [33] interpolates between 2D and 3D al-

gorithms, utilizing extra memory to reduce communication.

The 2.5D algorithm has been implemented in the Cyclops

Tensor Framework [34]. Finally, the COSMA [25] algorithm

takes a different approach by computing an optimal proces-

sor organization and parallelization strategy depending on

the target matrix dimensions and machine size.

Figure 9 depicts the communication pattern of these al-

gorithms and demonstrates how each can be implemented

in DISTAL. Although DISTAL cannot represent recursive

algorithms like CARMA [14], it still covers a space of widely

used algorithms. We now present detailed derivations of

SUMMA, Cannon’s and Johnson’s algorithms.

4.2

SUMMA

MPI-like pseudocode for SUMMA can be found in Figure 10.

SUMMA organizes the computation into a 2D grid and each

processor owns a tile of 𝐴, 𝐵, and 𝐶. Computation proceedsDISTAL: The Distributed Tensor Algebra Compiler

Algorithm

Target Machine Data Distribution M = Grid(𝑔𝑥, 𝑔𝑦) 𝐴 xy ↦→ xy M

𝐵 xy ↦→ xy M

𝐶 xy ↦→ xy M .distribute({i, j}, {io, jo}, {ii, ji}, Grid(gx, gy))

.divide(k, ko, ki, gx)

.reorder({ko, ii, ji, ki})

.rotate(ko, {io, jo}, kos)

.communicate(A, jo).communicate({B, C}, kos)

PUMMA [12]

(1994) M = Grid(𝑔𝑥, 𝑔𝑦) 𝐴 xy ↦→ xy M

𝐵 xy ↦→ xy M

𝐶 xy ↦→ xy M .distribute({i, j}, {io, jo}, {ii, ji}, Grid(gx, gy))

.divide(k, ko, ki, gx)

.reorder({ko, ii, ji, ki})

.rotate(ko, {io}, kos)

.communicate(A, jo).communicate({B, C}, kos)

SUMMA [35]

(1995) M = Grid(𝑔𝑥, 𝑔𝑦) 𝐴 xy ↦→ xy M

𝐵 xy ↦→ xy M

𝐶 xy ↦→ xy M .distribute({i, j}, {io, jo}, {ii, ji}, Grid(gx, gy))

.split(k, ko, ki, chunkSize)

.reorder({ko, ii, ji, ki})

.communicate(A, jo).communicate({B, C}, ko)

Johnson’s [1]

(1995) M =

√ √ √

Grid( 3 𝑝, 3 𝑝, 3 𝑝) 𝐴 xy ↦→ xy0 M

𝐵 xz ↦→ x0z M

𝐶 zy ↦→ 0yz M .distribute({i, j, k}, {io, jo, ko},

√

{ii, ji, ki}, Grid( 3 𝑝 , 3 𝑝 , 3 𝑝 ))

.communicate({A, B, C}, kn)

Solomonik’s [33]

(2011) M

√︃ = √︃

𝑝

Grid( 𝑐 , 𝑐 , 𝑐) 𝐴 xy ↦→ xy0 M

𝐵 xy ↦→ xy0 M

𝐶 xy ↦→ xy0 M COSMA [25]

(2019) induced by

schedule induced by

schedule

Cannon’s [8]

(1969)

Comm. Pattern

PLDI ’22, June 13–17, 2022, San Diego, CA, USA

Schedule

.distribute({i, j, k}, {io, jo, ko},

√︃

{ii, ji, ki}, Grid(

.divide(ki, kio, kii,

√︃

𝑝

𝑐 ,

√︃

𝑝

𝑐 , c))

𝑝

)

𝑐 3

.reorder({kio, il, jl, kii})

.rotate(kio, {io, jo}, kios)

.communicate(A, jo).communicate({B, C}, kios)

// gx, gy, gz, numSteps computed by COSMA scheduler.

.distribute({i, j, k}, {io, jo, ko}

{ii, ji, ki}, Grid(gx, gy, gz))

.divide(ki, kio, kii, numSteps)

.reorder({kio, il, jl, kii})

.communicate(A, ko).communicate({B, C}, kio)

Figure 9. Set of matrix-multiplication algorithms representable by DISTAL. For each algorithm, we show the high level

communication pattern, target machine organization, initial data distributions, and schedule of the compute statement

𝐴(𝑖, 𝑗) = 𝑘 𝐵(𝑖, 𝑘) · 𝐶 (𝑘, 𝑗). In the icons, black arrows indicate communications for 𝐴, blue for 𝐵 and red for 𝐶. The schedules

utilize the compound distribute command introduced in Subsection 3.3.

# Arrange 𝑝 processors into a 2D grid.

# Assign a tile of 𝐴 , 𝐵 , 𝐶 to each processor.

for all 𝑃 𝑖 𝑗 in parallel:

for kc in (0, k, chunkSize):

𝐵 𝑙 = row broadcast the kc to kc+chunkSize columns of 𝐵

𝐶 𝑙 = col broadcast the kc to kc+chunkSize rows of 𝐶

A += 𝐵 𝑙 × 𝐶 𝑙

Figure 10. Pseudocode for the SUMMA algorithm.

in chunks over the 𝑘 loop, where processors owning the 𝑘’th

chunks of the 𝐵 and 𝐶 loops broadcast the chunks within

their row and column respectively. Then, each processor

multiplies the communicated chunks of 𝐵 and 𝐶 into a local

tile of 𝐴.

SUMMA organizes the target machine as a 2D grid

(M = Grid(gx, gy)) and maps each tensor in tiles using

𝐴 xy ↦→ xy M, 𝐵 xy ↦→ xy M, and 𝐶 xy ↦→ xy M. The 𝑖 and 𝑗 iter-

ation space dimensions are distributed, and every proces-

sor locally iterates through the 𝑘 dimension. Therefore, we

apply distribute({i,j}, {io,jo}, {ii,ji}, Grid(gx,gy)) . Next,

SUMMA steps through 𝑘 in chunks expressible with split

(k, ko, ki, chunkSize) followed by reorder({ko, ii, ji, ki}) .

Finally, we schedule the communication: since each proces-

sor operates on a tile of 𝐴, we communicate(A, jo) . SUMMA

broadcasts 𝐵 and 𝐶 in chunks along the 𝑘 loop, so we commu-

nicate under the ko outer loop using communicate({B, C}, ko) .

This schedule implements SUMMA. Each processor steps

over the 𝑘 dimension of the iteration space, and performs

local matrix multiplications on chunks of 𝐵 and 𝐶. Every

processor operates on the same chunks of 𝐵 and 𝐶 in its row

and column, the processors that owns the chunks broadcast

them to the other processors in their rows and columns.

4.3

Cannon’s Algorithm

As another 2D algorithm, Cannon’s algorithm has the same

target machine and data distribution as SUMMA, but has a

systolic communication pattern, as seen in Figure 11, where

processors shift tiles along their row and column. Despite

the differences in pseudocode, the schedule for Cannon’s

algorithm is similar to the schedule for SUMMA’s. First, wePLDI ’22, June 13–17, 2022, San Diego, CA, USA

Rohan Yadav, Alex Aiken, and Fredrik Kjolstad

Index Variables 𝑖

Accesses

Expressions

Scheduling Relation

Statements

√

# Arrange 𝑝 processors into a 2D grid, 𝑝 × 𝑝 .

# Assign a tile of 𝐴 , 𝐵 , 𝐶 to each processor.

# Perform an initial data shift.

for all 𝑃 𝑖 𝑗 in parallel:

shift 𝐵 𝑖 𝑗 𝑖 spaces to the left

shift 𝐶 𝑖 𝑗 𝑗 spaces upwards

for all 𝑃 𝑖 𝑗 in parallel:

√

for k in (0, 𝑝 ):

𝐴 𝑖 𝑗 += 𝐵 𝑖 𝑗 × 𝐶 𝑖 𝑗

shift 𝐵 𝑖 𝑗 to the left

shift 𝐶 𝑖 𝑗 upwards

B(0,1) B(0,2)

B(1,1) B(1,2) B(1,0)

B(2,2) B(2,0) B(2,1)

ko = 0

B(0,1) B(0,2) B(0,0)

B(1,2) B(1,0) B(1,1)

B(2,0) B(2,1) B(2,2)

B(0,2) B(0,0) B(0,1)

B(1,0) B(1,1) B(1,2)

B(2,1) B(2,2) B(2,0)

ko = 1

ko = 2

Figure 12. Communication pattern of 𝐵 in the Cannon’s

algorithm schedule on a 3x3 grid of processors. 𝐵(𝑥, 𝑦) is

the (𝑥, 𝑦)th tile of 𝐵. At each iteration of ko , each processor

performs the rotated iteration kos = ko + io + jo mod 3 , ac-

cessing 𝐵( io, kos ). Each processor is labeled with the tile of

𝐵 needed at the current iteration. Blue arrows indicate from

where the data needed at the current iteration was sent.

√

# Arrange 𝑝 processors into a 3D grid, 3 𝑝 × 3 𝑝 × 3 𝑝 .

# Assign a tile of 𝐴 to each processor 𝑃 𝑖 𝑗 0 .

# Assign a tile of 𝐵 to each processor 𝑃 𝑖0𝑘 .

# Assign a tile of 𝐶 to each processor 𝑃 0𝑗𝑘 .

for all 𝑃 𝑖 𝑗𝑘 in parallel:

𝑃 𝑖0𝑘 broadcasts 𝐵 𝑖𝑘 to each 𝑃 𝑖 𝑗𝑘

𝑃 0𝑗𝑘 broadcasts 𝐶 𝑗𝑘 to each 𝑃 𝑖 𝑗𝑘

𝐴 𝑖 𝑗𝑘 = 𝐵 𝑖𝑘 × 𝐶 𝑘 𝑗

𝑃 𝑖 𝑗𝑘 sum reduces 𝐴 𝑖 𝑗𝑘 to 𝑃 𝑖 𝑗 0

Figure 13. Pseudocode for Johnson’s algorithm.

change the split operation into divide(k, ko, ki, gx) so that

the tiles communicated are the same size as the tiles held by

each processor. Next, we change the communication pattern

to the systolic pattern using rotate . We insert a rotate(ko, {

io,jo}, kos) to rotate each processors ko iteration space by

the processor’s coordinate in the processor grid. Figure 12

depicts the communication pattern of 𝐵 after rotation.

4.4

Constants 𝑐 Tensors T

::= T (𝑖∗)

::= 𝑎 | 𝑐 | 𝑒 + 𝑒 | . . .

::= divide(𝑖, 𝑖 𝑜 , 𝑖 𝑖 , 𝑐) | . . .

::= ∀ 𝑖 𝑆 | 𝑎 = 𝑒 | 𝑎 += 𝑒 |

𝑆 ; 𝑆 | 𝑆 s.t. 𝑟 ∗

Figure 14. Syntax for Concrete Index Notation

Figure 11. Pseudocode for Cannon’s algorithm.

B(0,0)

𝑎

𝑒

𝑟

𝑆

Johnson’s Algorithm

Johnson’s algorithm is a 3D algorithm, so it has a different

target machine and initial data distribution, as seen in Fig-

ure 13. Given 𝑝 processors, Johnson’s algorithm targets a

√

√ √ √

processor cube with side length 3 𝑝: M = Grid( 3 𝑝, 3 𝑝, 3 𝑝).

The input matrices are partitioned into tiles, and then fixed

to a face of the processor cube. We express the distribution

of 𝐴 using 𝐴 xy ↦→ xy0 M, which partitions 𝐴 by the first two

dimensions of M, and then fixes the partition onto a face of

M. The placements for 𝐵 and 𝐶 are similar, but restrict the

matrices to different faces of the processor cube.

The schedule for Johnson’s algorithm first distributes all

dimensions of the iteration space with distribute({i,j,k},

√ √ √

{io,jo,ko}, {ii,ji,ki}, Grid( 3 𝑝, 3 𝑝, 3 𝑝)) and communicates

under the distributed loop with communicate({A,B,C}, kn) . Ev-

ery processor performs a local multiplication using corre-

sponding chunks of 𝐵 and 𝐶, and reduces into 𝐴, matching

the original presentation of Johnson’s algorithm.

4.5

PUMMA, Solomonik’s Algorithm, and COSMA

The remaining algorithms in Figure 9 can be derived us-

ing the same principles. The PUMMA algorithm is a hybrid

between Cannon’s algorithm and the SUMMA algorithm,

using a broadcast to communicate one of the matrices and

a systolic pattern to communicate the other. Solomonik’s

algorithm operates on a processor cube, where each slice

performs Cannon’s algorithm on pieces of 𝐵 and 𝐶, and re-

duces the results of each slice into tiles of 𝐴. The schedule for

Solomonik’s algorithm is very similar to the 2D algorithms—

it also distribute -s over the 𝑘 dimension, and divide -s the

resulting inner 𝑘𝑖 loop into chunks. COSMA derives a sched-

ule that includes a machine organization, and a strategy for

distributing the 𝑖, 𝑗, and 𝑘 loops of the computation. Given

these parameters, our technique can generate code that im-

plements the distribution layer of COSMA. 4

Compilation

To implement the distributed scheduling commands in an

intermediate representation (IR) that can be reasoned about

by a compiler, we use the concrete index notation IR developed

by Kjolstad et al. [21] and Senanayake et al. [32].

5.1

Concrete Index Notation

Concrete index notation is a lower-level IR than tensor index

notation that specifies the ordering of for loops, and tracks

applied optimizations and loop transformations. Tensor in-

dex notation statements are lowered into concrete index

notation by constructing a loop nest based on a left-to-right

traversal of the variables in the tensor index notation state-

ment. The syntax for concrete index notation is shown in

Figure 14. Concrete index notation is further lowered into

an imperative IR by target specific backends.

Scheduling transformations are expressed through rewrite

rules on concrete index notation by rewriting loops and

tracking transformations through the s.t. clause. An example

4 The

COSMA algorithm additionally is able to split an iteration space

dimension sequentially. To express sequential splits, the outer dimensions

must be divide-ed, and then inner parallel splits can be distribute-ed.DISTAL: The Distributed Tensor Algebra Compiler

transformation rule for the divide command is

divide(𝑖,𝑖 𝑜 ,𝑖 𝑖 ,𝑐)

. . . ∀ 𝑖 𝑆 −−−−−−−−−−−→ . . . ∀ 𝑖 𝑜 ∀ 𝑖 𝑖 𝑆 s.t. divide(𝑖, 𝑖 𝑜 , 𝑖 𝑖 , 𝑐)

The full set of scheduling operations supported in TACO is

described by Senanayake et al. [32].

5.2

Intuitively, this procedure has two steps: 1) construct

an iteration space over T and any broadcasted dimen-

sions of M and 2) distribute the iteration space onto

M. For example, the concrete index notation statement

for T xy ↦→ x M is ∀ 𝑥𝑜 ∀ 𝑥𝑖 ∀ 𝑦 T (𝑥, 𝑦) s.t. divide(x, xo, xi, gx),

distribute(xo), communicate(T , xo).

Distributed Scheduling

We now describe how each of the distributed scheduling

commands transform concrete index notation statements.

distribute . The distribute transformation marks a loop as

distributed for a backend specific pass to elaborate further.

distribute(𝑖)

. . . ∀ 𝑖 𝑆 −−−−−−−−−→ . . . ∀ 𝑖 𝑆 s.t. distribute(𝑖)

rotate . Similarly to distribute , rotate is also expressed as

a transformation that adds the rotate relation to a statement.

rotate(𝑡,𝐼,𝑟 )

. . . ∀ 𝐼 ∀ 𝑡 𝑆 −−−−−−−−−→ . . . ∀ 𝐼 ∀ 𝑟 𝑆 s.t. rotate(𝑡, 𝐼, 𝑟 )

Rotation is implemented by setting 𝑡 = 𝑟 + 𝐼 mod extent(𝑡),

offsetting the starting point of 𝑡.

communicate . The communicate command aggregates the com-

munication of data necessary for a set of loop iterations to

the executing processor. The s.t. clause of the ∀ targeted

by a communicate statement stores all tensors which must be

communicated at the ∀.

communicate( T,𝑖)

. . . ∀ 𝑖 𝑆 −−−−−−−−−−−−−−→ . . . ∀ 𝑖 𝑆 s.t. communicate(T , 𝑖)

The target backend controls how to further lower a ∀ with

communicate relations, by introducing logic to communicate

accessed components for each iteration of the ∀.

5.3

PLDI ’22, June 13–17, 2022, San Diego, CA, USA

Lowering Tensor Distribution Notation

We implement the placement of a tensor into the distribu-

tion described by a tensor distribution notation statement by

translating the into a concrete index notation statement that

accesses data in the described orientation. The translation

algorithm is mechanical and uses the following steps for a

distribution T X ↦→ Y M, where 𝑋 and 𝑌 are sets of dimen-

sion names. The algorithm is extended to hierarchical tensor

distributions by applying the same idea for each distribution

level.

1. Let 𝑉 be a set of index variables for each name in 𝑋 ∪𝑌 .

2. Construct a concrete index notation statement 𝑆 of

nested ∀ loops for each variable in 𝑉 . At the innermost

loop, 𝑆 accesses T with variables in 𝑉 corresponding

to names in 𝑋 . ∀’s corresponding to dimensions in M

fixed to a value are restricted to that value.

3. reorder ∀’s in 𝑆 such that the variables for 𝑌 are the

shallowest in the loop nest.

4. divide each index variable for 𝑋 by the corresponding

dimension of M, and distribute the outer variable.

5. communicate T underneath all distributed variables.

Implementation

We target the Legion[5] distributed task-based runtime sys-

tem. Legion implements several features necessary for high

performance on modern machines, but orthogonal to the

topics we discuss, including 1) overlap of communication

and computation, 2) data movement through deep memory

hierarchies, 3) native support for accelerators and 4) control

over placement of data and computation in target memories

and processors. Therefore, our strategies for further lower-

ing of concrete index notation are directed by Legion’s API.

Legion performs dynamic analysis to facilitate communica-

tion between disjoint memory spaces. We discuss strategies

for static communication analysis that are compatible with

our approach in Section 8. We focus in this section on lower-

ing concepts related to distribution and communication. For

lowering of sub-statements that execute on a single CPU or

GPU, we follow the same process used in TACO.

6.1

Legion Programming Model

Regions are Legion’s abstraction for distributed data struc-

tures. Regions can be viewed as multi-dimensional arrays,

and we use them to represent dense tensors.

Tasks are the unit of computation in Legion. Tasks operate

on regions, and the runtime system is responsible for moving

data that a task requires into a memory accessible by the

processor the task is running on before the task begins exe-

cution. When launching a task, users provide information to

the runtime system naming the regions on which the task

operates. Multiple independent instances of the same task

can be launched in a single operation called an index task

launch, which is similar to a parallel for construct.

Regions can be partitioned into subregions that can be op-

erated on in parallel by tasks. Legion has several ways to cre-

ate partitions, including an API that uses hyper-rectangular

bounding boxes to partition regions into subregions.

Legion’s mapping interface allows for control over aspects

of execution such as which processors tasks execute on and

which memories regions are allocated in.

Communication in Legion is implicit. The data desired to

be communicated is described to Legion through application

created partitions. Legion then handles the physical move-

ment of this data through channels specialized for the source

and destination memories, such as NVLink for GPU-GPU

communication and GASNet-EX for inter-node communica-

tion.PLDI ’22, June 13–17, 2022, San Diego, CA, USA

6.2

Lowering to Legion

Our lowering process to Legion is guided by scheduling

relations in the concrete index notation. ∀’s tagged as dis-

tributed are lowered into index task launches over the extent

of the loop, where the loop bodies are placed within Legion

tasks. Directly nested distributed loops are flattened into

multi-dimensional index task launches. Legion partitions are

created for each tensor denoted to communicate under a loop.

The bounds of the hyper-rectangles to use in the partitioning

API are derived using a standard bounds analysis procedure

using the extents of index variables. Directives about proces-

sor kinds and machine grids are communicated to a mapper

that places tasks in the desired processor orientations.

Evaluation

Experimental Setup. We ran our experiments on the

Lassen [26] supercomputer. Each Lassen node has a dual

socket IBM Power9 CPU with 40 available cores. Each

node contains four NVIDIA Volta V100 GPUs connected

by NVLink 2.0 and an Infiniband EDR interconnect. All code

was compiled with GCC 8.3.1 -O3 and CUDA 11.1. Legion 5

was configured with GASNet-EX 2021.3.0 for communica-

tion.

Comparison Targets. We compare against ScaLAPACK

as provided by LAPACK on the Lassen system (version 3.9.0),

Cyclops Tensor Framework 6 , and the original COSMA 7 im-

plementation. All systems were configured to use Open-

BLAS 8 compiled with OpenMP, and (if supported) the

CuBLAS shipped with CUDA 11.1 for local BLAS operations.

Overview. To evaluate the absolute performance of our

system, we present comparisons of CPU and GPU matrix-

multiplication performance between DISTAL and existing li-

braries. We then consider several higher order tensor compu-

tations to show that our system achieves good performance

on kernels that receive less attention from researchers. These

comparisons collectively demonstrate that DISTAL achieves

high absolute performance, comparable to hand-tuned codes,

and that DISTAL’s abstractions generalize to optimization

of higher order tensor kernels.

7.1

Distributed Matrix-Multiplication Benchmarks

We evaluate DISTAL’s implementations of distributed matrix-

multiplication by comparing against ScaLAPACK [11], Cy-

clops Tensor Framework (CTF) [34] and COSMA [25]. ScaLA-

PACK is a well-known library that provides a variety of dis-

tributed kernels for many common linear algebra operations.

CTF is a library for distributed tensor algebra that imple-

ments the 2.5D matrix multiplication algorithm of Solomonik

et al. [33]. Finally, COSMA is recent work that provides both

Rohan Yadav, Alex Aiken, and Fredrik Kjolstad

a theoretically optimal algorithm as well as the currently

best known dense matrix-multiplication implementation. Of

these systems, only COSMA has a GPU backend, so we re-

strict our comparison to COSMA in the GPU setting. 9

For comparison, we implement all algorithms discussed in

Section 4, including the SUMMA algorithm used by ScaLA-

PACK, the 2.5D algorithm used by CTF, and the COSMA al-

gorithm. Results for CPUs and GPUs are in Figure 15. These

experiments are weak-scaled (memory per node stays con-

stant) on square matrices, where the initial problem sizes

were 8192x8192 and 20000x20000 for CPUs and GPUs re-

spectively. Initial problem sizes were chosen to be just large

enough to achieve peak utilization on a single node.

For CPUs, we find COSMA performs best with 40 ranks

per node, while ScaLAPACK and CTF perform best with 4

ranks per node. For GPUs, we run COSMA with one rank

per GPU. DISTAL’s kernels are run with one rank per node.

7.1.1 CPU Results. Figure 15a contains the CPU dis-

tributed matrix-multiplication benchmarks. At 256 nodes,

CTF and ScaLAPACK achieve at most 80% performance of

codes generated by our system and COSMA. They also ex-

perience performance variability due to effects from non-

square machine grids. COSMA and DISTAL achieve higher

performance by overlapping communication and computa-

tion more effectively—our profiles show that for CPUs, it is

possible to hide nearly all communication costs with com-

putation. Since most communication costs can be hidden,

we do not find significant performance variations between

the different algorithms implemented in DISTAL, except for

Johnson’s algorithm, which experiences performance degra-

dation on processor grids that aren’t perfect cubes (we model

each CPU socket as an abstract DISTAL processor). Finally,

we see that our best schedules are within 10% of the per-

formance of COSMA on all node counts, and within 5% on

256 nodes. DISTAL has a penalty compared to COSMA be-

cause we allocate 4 CPU cores per node to Legion to perform

runtime dependence analysis and utility work. 10 The line

named "COSMA (Restricted CPUs)" in Figure 15a shows that

COSMA achieves equal performance to DISTAL when re-

stricted to use 36 out of the 40 CPU cores, which is number of

work cores allocated to DISTAL. Although use of the Legion

runtime imposes a small cost (5-10%), we believe that this

cost is worth paying for simpler engineering and improved

programmer productivity.

7.1.2 GPU Results. Figure 15b contains the GPU dis-

tributed matrix-multiplication benchmarks. On a single node,

all of our kernels achieve twice the performance of COSMA,

and COSMA achieves 15% higher performance on 256 nodes

than DISTAL’s best performing schedule. COSMA keeps all

5 https://gitlab.com/StanfordLegion/legion/,

commit 8ca3331c5d.

commit 36b1f6de53

7 https://github.com/eth-cscs/COSMA/, commit c7bdab95ba

8 https://github.com/xianyi/OpenBLAS/, commit 37ea8702e

6 https://github.com/cyclops-community/ctf,

9 CTF

advertises GPU support, but we were not able to successfully build it.

number of runtime cores is a configurable parameter. We found the

best performance with 4 cores per node.

10 TheDISTAL: The Distributed Tensor Algebra Compiler

COSMA

COSMA (Restricted CPUs)

Our Solomonik's

PLDI ’22, June 13–17, 2022, San Diego, CA, USA

CTF

SCALAPACK

Our Cannon Our SUMMA

Our Johnson's Our COSMA

Peak Utilization

Our PUMMA

COSMA

750

Our Cannon's

Our SUMMA

Our PUMMA

Our Solomonik's

Peak Utilization

Our Johnson's

Our Cosma

30000

650

20000

550

450

10000

350

250

)

1 (

2 (

)

4 (

8 (

(

)

(12

(25

)

8 (

)

1 (

(10

Nodes (CPU Cores)

2 (

)

4 (

)

8 (

(64

)

(12

(25

(51

6 (

)

Nodes (GPUs)

(a) CPU results.

(b) GPU results.

Figure 15. Weak-scaling results (higher is better) for matrix-multiplication. DISTAL’s kernels are prefixed with "Our" and are

denoted with filled lines. Other systems (COSMA, CTF and ScaLAPACK) are denoted with dotted lines.

data in CPU memory and uses an out-of-core GEMM kernel

that pulls data into the GPU for computation with CuBLAS.

DISTAL’s kernels keep all data in GPU framebuffer memory

and communicate via NVLink, achieving near-peak utiliza-

tion on a single node 11 .

In contrast to the CPU experiments, we see different per-

formance characteristics between algorithms when mov-

ing to multiple nodes. The larger problem sizes required to

achieve peak GFLOP/s on a single GPU cause the compu-

tation to be evenly balanced between communication and

computation, and therefore extremely sensitive to the com-

munication costs of each algorithm.

The 2D algorithms (Cannon, SUMMA and PUMMA) per-

form well at square node counts (equaling or exceeding the

performance of COSMA), and perform worse at rectangular

node counts, similar to the variations seen with ScaLAPACK

and CTF on CPUs. This variation comes from the imbalanced

communication in the rectangular case. Within the 2D algo-

rithm family, we see that Cannon’s algorithm outperforms

SUMMA and PUMMA as the node count increases. The dif-

ference is the systolic communication pattern enabled by

rotate that Cannon’s algorithm uses. Avoiding the collective-

style broadcasting operations and using nearest-neighbor

communication only, our schedule using Cannon’s algorithm

achieves higher performance at scale.

The 3D algorithms (Johnson’s, Solomonik’s 2.5D and

COSMA) trade communication for extra memory use, and

achieve higher performance on the non-square node counts

than their 2D counterparts. Johnson’s algorithm achieves

high performance when the number of GPUs is a perfect

cube, but for non-cubes achieves worse performance due

to over-decomposition. Our implementation of COSMA

11 The

COSMA developers did not have access to machines with NVLink

during development and thus do not include NVLink support [24]. NVLink

support for COSMA is an engineering limitation and not fundamental.

achieves better performance than Johnson’s algorithm be-

cause it can adapt the decomposition to the machine size, but

does not equal the performance of the 2D algorithms due to

the lack of matrix-multiplication specialized broadcast and

reduction operators as in the COSMA author’s implementa-

tion. Johnson’s algorithm and our COSMA implementation

run out of memory at 32 nodes, because they replicate input

components onto multiple nodes, they exhaust the limited

GPU memory at higher node counts. The COSMA author’s

implementation use the larger CPU memory to hold matrices.

Solomonik’s 2.5D algorithm interpolates between the 2D and

3D algorithms, using extra memory when possible to speed

up the computation. In our implementation, we utilize extra

memory on the non-square node counts, resulting in bet-

ter performance than the 2D algorithms on those processor

counts.

DISTAL’s kernels perform worse than COSMA and expe-

rience larger performance variations due to communication

costs. Legion’s DMA system is unable to achieve peak band-

width out of a node (18/25 GB/s) when data is resident in the

GPU framebuffer memory, while the system (and COSMA)

achieve near peak when data is resident in CPU memory.

The Legion team plans to address this shortcoming in the

future.

7.2

Higher Order Tensor Benchmarks

To evaluate the generality of our system, we compare against

the Cyclops Tensor Framework (CTF), which is the only

system that we know of that offers similar generality: dis-

tributed implementations of any tensor algebra operation.

We consider the following tensor expressions:

• Tensor times vector (TTV): 𝐴(𝑖, 𝑗) = 𝐵(𝑖, 𝑗, 𝑘) · 𝑐 (𝑘)

• Inner product (Innerprod): 𝑎 = 𝐵(𝑖, 𝑗, 𝑘) · 𝐶 (𝑖, 𝑗, 𝑘)

• Tensor times matrix (TTM): 𝐴(𝑖, 𝑗, 𝑙) = 𝐵(𝑖, 𝑗, 𝑘) · 𝐶 (𝑘, 𝑙)PLDI ’22, June 13–17, 2022, San Diego, CA, USA

250

Rohan Yadav, Alex Aiken, and Fredrik Kjolstad

1500 300

1000 200

1000

100

500

)

1 (

)

2 (

(16

)

(32

100

)

(64

(12 4 (25 8 (51

6 (

750

150

200

1 (

2 (

)

4 (

Nodes (CPU Cores)

)

(32

6 (

)

(12

)

1 (

)

(25 8 (51

6 (

2 (

)

(16

Nodes (GPUs)

(32

)

1 (

)

2 (

4 (

)

8 (

20000

10000

)

(64

(12 4 (25 8 (51

6 (

Nodes (CPU Cores)

1 (

2 (

)

4 (

)

8 (

(

)

(12 4 (25 8 (51

6 (

800 25000

600 20000

400

200

)

1 (

Nodes (GPUs)

)

4 (

8 (

)

(

(12

(25

)

6 (

8 (

Nodes (GPUs)

200

2 (

(b) Innerprod

per

400

1 (

Nodes (CPU Cores)

30000

600

250

)

(64

(12 4 (25 8 (51

6 (

(a) TTV

800

500

2 (

)

4 (

8 (

)

(12 4 (25 8 (51

6 (

15000

10000

5000

(

1 (

2 (

Nodes (CPU Cores)

Ours

CTF

)

4 (

)

8 (

)

(

(12

(25

)

6 (

8 (

Nodes (GPUs)

(d) MTTKRP

Figure 16. Weak-scaling results (higher is better) of higher order tensor computations.

• Matricized tensor times Khatri-Rao Product (MTT-

KRP): 𝐴(𝑖, 𝑙) = 𝐵(𝑖, 𝑗, 𝑘) · 𝐶 ( 𝑗, 𝑙) · 𝐷 (𝑘, 𝑙)

These operations all have real-world applications. For ex-

ample, the TTM and MTTKRP kernels are important build-

ing blocks in routines that compute Tucker and canonical

polyadic decompositions of tensors [23].

CTF decomposes arbitrary tensor operations into calls to a

distributed matrix-multiplication implementation by slicing

and reshaping the tensors. While such a strategy can im-

plement all tensor algebra operations, it cannot implement

an optimal strategy for every operation. Our approach of

generating a bespoke implementation for a target kernel

allows for development of a schedule that implements an

optimal strategy for each kernel. Our experiments show that

we outperform CTF on each higher-order tensor expression

that we consider. The tradeoff between these systems is that

CTF fully automates the distribution process, while users

must provide a schedule to distribute their computations

in DISTAL. We note that the scheduling primitives in DIS-

TAL provide a mechanism for future work to target when

automatically schedule computations for distribution.

Weak-scaling (memory per node stays constant) experi-

ments for CPUs and GPUs are shown in Figure 16. For kernels

that are bandwidth bound (TTV and innerprod), we report

results in GB/s, rather than GFLOP/s. We run CTF with 4

ranks per node, for all kernels other than innerprod, where

we found best performance at 40 ranks per node. Because

we were unable to build CTF’s GPU backend, we only report

CPU results. For each higher order kernel, we either exper-

imented with different schedules to minimize inter-node

communication (TTV, innerprod, TTM) or implemented a

known algorithm (MTTKRP [4]). Input tensors were laid

out in a row-major layout and distributed in a manner that

matched the chosen schedule or as specified by a proposed

algorithm (MTTKRP). We used the same data distributions

and distribution schedules for the CPU and GPU kernels. As

with the matrix-multiplication benchmarks, initial problem

sizes were chosen to be just large enough to achieve peak

utilization on a single node.

7.2.1 Scheduling Leaf Kernels. Leaf kernel performance

heavily impacts the overall performance of a distributed com-

putation. To keep the leaf kernels similar between DISTAL

and CTF, we use the same strategy as CTF on a single node

and cast the TTM and MTTKRP operations to loops of matrix-

multiplications. For element-wise operations (TTV and in-

nerprod), we parallelize and vectorize loops for CPUs, and

tile loops to thread blocks for GPUs. CTF aims at scalability

to large core counts rather than fully utilizing the resources

on a single node. This approach results in worse single-node

performance for the TTV, innerprod and MTTKRP bench-

marks (when casting MTTKRP to matrix-multiplications, an

element-wise reduction operation is required).

7.2.2 Results. TTV (Figure 16a). Casting the TTV oper-

ation as a sequence of matrix-multiplication operations per-

forms unnecessary communication and CTF’s performance

drops past a single node. Instead, our schedule using DISTAL

performs the operation element-wise without communica-

tion. Our GPU kernel achieves higher bandwidth than the

CPU kernel, but starts to fall off at 256 GPUs due to the

kernels’ short execution time (several milliseconds).

Innerprod (Figure 16b). The inner-product kernel is best

implemented as a node-level reduction followed by a global

reduction over all nodes, and we use this strategy in the DIS-

TAL schedule. CTF achieves good weak scaling performance

as well, but is still slower than our implementation using

DISTAL. The GPU implementation of innerprod has similar

performance characteristics as TTV.

TTM (Figure 16c). Our DISTAL schedule for TTM ex-

presses the kernel as a set of parallel matrix-multiplication

operations by distribute -ing the 𝑖 loop of the kernel. ThisDISTAL: The Distributed Tensor Algebra Compiler

strategy results in no inter-node communication and both

our CPU and GPU implementations achieve high efficiency

up to 256 nodes. Instead, CTF casts the kernel as a loop of

distributed matrix-multiplications, which results in a large

drop in performance due to inter-node communication.

MTTKRP (Figure 16d). DISTAL allows for implement-

ing specialized algorithms for tensor algebra operations. We

implement the algorithm of Ballard et al. [4] that keeps the

3-tensor in place and reduces intermediate results into the

output tensor. DISTAL’s kernels fall off after 64 nodes due to

overheads of algorithms used within Legion to manage the

situation where portions of regions are replicated onto many

nodes. The Legion team plans to address this shortcoming

in the future. CTF does not achieve similar performance on

a single node, but has flat scaling behavior.

Related Work

Distributed Tensor Algebra. The only system that we

know of to support distributed execution of any tensor alge-

bra kernel is the Cyclops Tensor Framework (CTF) [34]. CTF

casts tensor contractions into a series of distributed matrix-

multiplication operations and transposes, an approach also

used by the single-node Tensor Contraction Engine [6]. Our

approach of generating specialized kernel implementations

can lead to improvements over CTF’s interpreted approach.

Distributed algorithms for tensor algebra have drawn in-

terest from researchers, including the matrix-multiplication

algorithms in Figure 9 and algorithms for higher order tensor

algebra like MTTKRP [4]. DISTAL provides a framework to

model and generate implementations for these algorithms.

Related work in the database community has taken steps

to extend relational database engines to support distributed

linear algebra [28] and distributed tensor computations [19].

DSL Compilers. Several DSL compilers for single-node

systems have been developed, such as Halide [29], TVM [10],

Tensor Comprehensions [36], COGENT [20] and TACO [22].

Distributed Halide [15] and Tiramisu [3] support targeting

distributed machines. Distributed Halide extends Halide with

data and computation distribution commands. DISTAL sup-

ports richer data distributions (such as broadcasting and

fixing tensor partitions) and targets tensor algebra instead

of stencil codes. Tiramisu is a polyhedral compiler that can

target distributed machines and supports scheduling com-

mands that distribute computation and communicate data.

However, these commands require more user input: users

must describe the data to communicate and the processors

involved. Recent work by Ihadadene [18] automates these

components for stencil codes. Unlike DISTAL, Tiramisu does

not have a data distribution language or a rotate command,

which limits its ability to generate some sophisticated dis-

tributed algorithms. DISTAL is the first system of this kind

to separate data and computation distributions, allowing

PLDI ’22, June 13–17, 2022, San Diego, CA, USA

for expression of many existing distributed tensor algebra

algorithms, including all of the algorithms in Figure 9.

Modeling Machines and Distributing Data. Common

data distributions of arrays (such as row, column, and tiled)

were first included as directives in languages for distributed

programming (e.g., High Performance Fortran [27]). ZPL in-

troduced the idea of separately defining an abstract machine

(e.g., as a grid of processors) and a function defining a par-

titioning and mapping of data onto that machine [13]. This

approach has been adopted by Chapel [9] and extended in

Sequoia (to a hierarchical abstract machine [17]) and Legion

[5] (to allowing multiple partitions of the same data or to be

used simultaneously).

Dryden et al. [16] use a similar notation as DISTAL that

describes how each dimension of a tensor is distributed

to describe algorithms for distributed convolutional neu-

ral network training. Dryden et al. were inspired by the

FLAME [30, 31] project, which used a set notation to de-

scribe tensor distributions and redistribution of tensors into

different distributed layouts. DISTAL combines data distri-

bution descriptions with a separate computation scheduling

language to allow for expression of many distributed algo-

rithms.

Distributed Polyhedral Compilation. Amarasinghe et

al. [2] and Bondhugula [7] used polyhedral analysis to derive

communication information for computation distribution

from affine loop nests statically. Our work starts from a

higher level representation that allows for the expression of

different algorithms through scheduling, whereas such deci-

sions would need to be already present in the affine loops

targeted by these works. The analysis of Amarasinghe et

al. is fully static on a set of virtual processors, but can re-

sult in imprecise communication when mapping the virtual

processors onto the physical processors. In contrast, Bond-

hugula generates a set of runtime calls that complement the

static analysis to determine precise communication partners.

The ideas presented by Bondhugula and Amarasinghe et al.

could be used as analysis passes for an MPI-based backend

for DISTAL and are thus orthogonal to our approach.

Discussion. The critical difference between DISTAL and

prior work is the notion of specifying independently how

computation and data are distributed in two high level lan-

guages. The separation of these two concepts allows for

flexibility in expression of different algorithms and adaptabil-

ity when integrating with existing codes—code can shape to

data so that data may stay at rest. Additionally, our combined

static-dynamic approach allows for expression of complex

communication patterns and data distributions statically,

while discharging lower level data movement operations to

a runtime system. This design decision allows us to avoid

complicated and, in some cases, brittle analyses used by fully

static approaches.PLDI ’22, June 13–17, 2022, San Diego, CA, USA

Future Work

We see many interesting avenues of future work from DIS-

TAL. One such avenue is DISTAL’s potential applications

in training and evaluating distributed deep learning models,

where DISTAL can be used to generate distributed kernels

for stages in the model. As DISTAL allows for separation of

both data and computation distribution, these parameters

could be included in search-based approaches to deep learn-

ing model distribution. Another avenue is auto-scheduling

and auto-formatting frameworks for DISTAL. Currently, DIS-

TAL is a useful productivity tool allowing for application

developers to develop code at a high level while performance

engineers can optimize the mapping without changing appli-

cation code. With automatic schedule and format selection,

application developers could independently achieve high

performance and allow performance engineers to optimize

further when an automatic schedule is insufficient. A third

avenue of future work that we are currently undertaking is

to is to extend DISTAL with support for sparse tensors. The

envisioned system would enable users to create distributed

implementations of any desired tensor computation with

any set of tensor formats.

Conclusion

We have introduced DISTAL, a compiler for dense tensor al-

gebra that targets modern, heterogeneous machines. DISTAL

allows for independent specifications of desired computa-

tion, data distribution, and computation distribution. The

combination of data distribution and computation distribu-

tion allows for expression of widely known algorithms and

optimization of higher order tensor kernels. DISTAL gener-

ates code competitive with hand-optimized implementations

of matrix-multiplication and outperforms existing systems

on higher order tensor kernels by between 1.8x and 3.7x.

Acknowledgements

We would like to thank our anonymous reviewers, and espe-

cially our shepherd, for their valuable comments that helped

us improve this manuscript. We would like to thank Olivia

Hsu, Charles Yuan, Axel Feldmann, Elliot Slaughter and

David Lugato for their comments on early stages of this

manuscript. We would like to thank the Legion team, in-

cluding Mike Bauer, Sean Treichler, Manolis Papadakis, and

Wonchan Lee for their feedback and support during the de-

velopment of DISTAL. We would like to thank the COSMA

and CTF authors for their assistance in setup and benchmark-

ing of their software. Rohan Yadav was supported by an NSF

Graduate Research Fellowship. This work was supported

in part by the Advanced Simulation and Computing (ASC)

program of the US Department of Energy’s National Nuclear

Security Administration (NNSA) via the PSAAP-III Center at

Stanford, Grant No. DE-NA0002373 and the Exascale Com-

puting Project (17-SC-20-SC), a collaborative effort of the

Rohan Yadav, Alex Aiken, and Fredrik Kjolstad

U.S. Department of Energy Office of Science and the Na-

tional Nuclear Security Administration; the U.S. Department

of Energy, Office of Science under Award DE-SCOO21516.

This work was also supported by the Department of Energy

Office of Science, Office of Advanced Scientific Computing

Research under the guidance of Dr. Hal Finkel.

References

[1] R. C. Agarwal, S. M. Balle, F. G. Gustavson, M. Joshi, and P. Palkar.

1995. A three-dimensional approach to parallel matrix multiplication.

IBM Journal of Research and Development 39, 5 (1995), 575–582. https:

//doi.org/10.1147/rd.395.0575

[2] Saman Amarasinghe and Monica Lam. 1993. Communication Op-

timization and Code Generation for Distributed Memory Machines.

Sigplan Notices - SIGPLAN 28, 126–138. https://doi.org/10.1145/173262.

155102

[3] Riyadh Baghdadi, Jessica Ray, Malek Ben Romdhane, Emanuele Del

Sozzo, Abdurrahman Akkas, Yunming Zhang, Patricia Suriana, Shoaib

Kamil, and Saman Amarasinghe. 2018. Tiramisu: A Polyhedral Com-

piler for Expressing Fast and Portable Code. arXiv:1804.10694 [cs.PL]

[4] Grey Ballard, Nicholas Knight, and Kathryn Rouse. 2018. Communica-

tion Lower Bounds for Matricized Tensor Times Khatri-Rao Product. In

2018 IEEE International Parallel and Distributed Processing Symposium

(IPDPS). 557–567. https://doi.org/10.1109/IPDPS.2018.00065

[5] Michael Bauer, Sean Treichler, Elliott Slaughter, and Alex Aiken. 2012.

Legion: Expressing Locality and Independence with Logical Regions.

In Proceedings of the International Conference on High Performance

Computing, Networking, Storage and Analysis (Salt Lake City, Utah)

(SC ’12). IEEE Computer Society Press, Washington, DC, USA, Article

66, 11 pages.

[6] G. Baumgartner, A. Auer, D.E. Bernholdt, A. Bibireata, V. Choppella, D.

Cociorva, Xiaoyang Gao, R.J. Harrison, S. Hirata, S. Krishnamoorthy,

S. Krishnan, Chi chung Lam, Qingda Lu, M. Nooijen, R.M. Pitzer, J.

Ramanujam, P. Sadayappan, and A. Sibiryakov. 2005. Synthesis of

High-Performance Parallel Programs for a Class of ab Initio Quantum

Chemistry Models. Proc. IEEE 93, 2 (2005), 276–292. https://doi.org/

10.1109/JPROC.2004.840311

[7] Uday Bondhugula. 2013. Compiling Affine Loop Nests for Distributed-

Memory Parallel Architectures. In Proceedings of the International

Conference on High Performance Computing, Networking, Storage and

Analysis (Denver, Colorado) (SC ’13). Association for Computing Ma-

chinery, New York, NY, USA, Article 33, 12 pages. https://doi.org/10.

1145/2503210.2503289

[8] Lynn Elliot Cannon. 1969. A Cellular Computer to Implement the

Kalman Filter Algorithm. Ph.D. Dissertation. USA. AAI7010025.

[9] B.L. Chamberlain, D. Callahan, and H.P. Zima. 2007.

Paral-

lel Programmability and the Chapel Language.

The Interna-

tional Journal of High Performance Computing Applications 21,

3 (2007), 291–312.

https://doi.org/10.1177/1094342007078442

arXiv:https://doi.org/10.1177/1094342007078442

[10] Tianqi Chen, Thierry Moreau, Ziheng Jiang, Lianmin Zheng, Ed-

die Yan, Meghan Cowan, Haichen Shen, Leyuan Wang, Yuwei Hu,

Luis Ceze, Carlos Guestrin, and Arvind Krishnamurthy. 2018. TVM:

An Automated End-to-End Optimizing Compiler for Deep Learning.

arXiv:1802.04799 [cs.LG]

[11] J. Choi, J.J. Dongarra, R. Pozo, and D.W. Walker. 1992. ScaLAPACK:

a scalable linear algebra library for distributed memory concurrent

computers. In [Proceedings 1992] The Fourth Symposium on the Frontiers

of Massively Parallel Computation. 120–127. https://doi.org/10.1109/

FMPC.1992.234898

[12] Jaeyoung Choi, David W. Walker, and Jack J. Dongarra. 1994.

Pumma: Parallel universal matrix multiplication algorithms on

distributed memory concurrent computers. Concurrency: Practice andDISTAL: The Distributed Tensor Algebra Compiler

[13]

[14]

[15]

[16]

[17]

[18]

[19]

[20]

[21]

[22]

[23]

[24]

[25]

[26]

[27]

Experience 6, 7 (1994), 543–570. https://doi.org/10.1002/cpe.4330060702

arXiv:https://onlinelibrary.wiley.com/doi/pdf/10.1002/cpe.4330060702

S.J. Deitz, B.L. Chamberlain, and L. Snyder. 2004. Abstractions for

dynamic data distribution. In Ninth International Workshop on High-

Level Parallel Programming Models and Supportive Environments, 2004.

Proceedings. 42–51. https://doi.org/10.1109/HIPS.2004.1299189

James Demmel, David Eliahu, Armando Fox, Shoaib Kamil, Benjamin

Lipshitz, Oded Schwartz, and Omer Spillinger. 2013. Communication-

Optimal Parallel Recursive Rectangular Matrix Multiplication. In 2013

IEEE 27th International Symposium on Parallel and Distributed Process-

ing. 261–272. https://doi.org/10.1109/IPDPS.2013.80

Tyler Denniston, Shoaib Kamil, and Saman Amarasinghe. 2016. Dis-

tributed Halide. In Proceedings of the 21st ACM SIGPLAN Symposium

on Principles and Practice of Parallel Programming (Barcelona, Spain)

(PPoPP ’16). Association for Computing Machinery, New York, NY,

USA, Article 5, 12 pages. https://doi.org/10.1145/2851141.2851157

Nikoli Dryden, Naoya Maruyama, Tim Moon, Tom Benson, Marc Snir,

and Brian Van Essen. 2019. Channel and Filter Parallelism for Large-

Scale CNN Training. In Proceedings of the International Conference

for High Performance Computing, Networking, Storage and Analysis

(Denver, Colorado) (SC ’19). Association for Computing Machinery,

New York, NY, USA, Article 10, 20 pages. https://doi.org/10.1145/

3295500.3356207

Kayvon Fatahalian, Daniel Reiter Horn, Timothy J. Knight, Larkhoon

Leem, Mike Houston, Ji Young Park, Mattan Erez, Manman Ren,

Alex Aiken, William J. Dally, and Pat Hanrahan. 2006. Sequoia:

Programming the Memory Hierarchy. In Proceedings of the 2006

ACM/IEEE Conference on Supercomputing (Tampa, Florida) (SC ’06).

Association for Computing Machinery, New York, NY, USA, 83–es.

https://doi.org/10.1145/1188455.1188543

Thinhinane Ihadadene. 2019. Generating Communication Code Auto-

matically for Distributed Programs in Tiramisu.

Dimitrije Jankov, Binhang Yuan, Shangyu Luo, and Chris Jermaine.

2021. Distributed Numerical and Machine Learning Computations via

Two-Phase Execution of Aggregated Join Trees. Proc. VLDB Endow. 14,

7 (March 2021), 1228–1240. https://doi.org/10.14778/3450980.3450991

Jinsung Kim, Aravind Sukumaran-Rajam, Vineeth Thumma, Sriram

Krishnamoorthy, Ajay Panyala, Louis-Noël Pouchet, Atanas Rountev,

and P. Sadayappan. 2019. A Code Generator for High-Performance

Tensor Contractions on GPUs. In 2019 IEEE/ACM International Sym-

posium on Code Generation and Optimization (CGO). 85–95. https:

//doi.org/10.1109/CGO.2019.8661182

Fredrik Kjolstad, Peter Ahrens, Shoaib Kamil, and Saman Amarasinghe.

2019. Tensor Algebra Compilation with Workspaces. In 2019 IEEE/ACM

International Symposium on Code Generation and Optimization (CGO).

180–192. https://doi.org/10.1109/CGO.2019.8661185

Fredrik Kjolstad, Shoaib Kamil, Stephen Chou, David Lugato, and

Saman Amarasinghe. 2017. The Tensor Algebra Compiler. Proc. ACM

Program. Lang. 1, OOPSLA, Article 77 (Oct. 2017), 29 pages. https:

//doi.org/10.1145/3133901

Tamara G. Kolda and Brett W. Bader. 2009. Tensor Decompositions

and Applications. SIAM Rev. 51, 3 (Aug. 2009), 455–500. https://doi.

org/10.1137/07070111X

Grzegorz Kwasniewski. [n.d.]. personal communication.

Grzegorz Kwasniewski, Marko Kabić, Maciej Besta, Joost VandeVon-

dele, Raffaele Solcà, and Torsten Hoefler. 2019. Red-Blue Pebbling

Revisited: Near Optimal Parallel Matrix-Matrix Multiplication. In Pro-

ceedings of the International Conference for High Performance Com-

puting, Networking, Storage and Analysis (Denver, Colorado) (SC ’19).

Association for Computing Machinery, New York, NY, USA, Article

24, 22 pages. https://doi.org/10.1145/3295500.3356181

LLNL. 2021. Lassen. https://hpc.llnl.gov/hardware/platforms/lassen

D.B. Loveman. 1993. High performance Fortran. IEEE Parallel Dis-

tributed Technology: Systems Applications 1, 1 (1993), 25–42. https:

PLDI ’22, June 13–17, 2022, San Diego, CA, USA

//doi.org/10.1109/88.219857

[28] Shangyu Luo, Zekai J. Gao, Michael Gubanov, Luis L. Perez, and

Christopher Jermaine. 2017. Scalable Linear Algebra on a Relational

Database System. In 2017 IEEE 33rd International Conference on Data

Engineering (ICDE). 523–534. https://doi.org/10.1109/ICDE.2017.108

[29] Jonathan Ragan-Kelley, Connelly Barnes, Andrew Adams, Sylvain

Paris, Frédo Durand, and Saman Amarasinghe. 2013. Halide: A Lan-

guage and Compiler for Optimizing Parallelism, Locality, and Recom-

putation in Image Processing Pipelines. SIGPLAN Not. 48, 6 (June

2013), 519–530. https://doi.org/10.1145/2499370.2462176

[30] Martin Daniel Schatz. 2015. Distributed Tensor Computations: Formal-

izing Distributions, Redistributions, and Algorithm Derivation. Ph.D.

Dissertation. USA.

[31] Martin D. Schatz, Robert A. Geijn, and Jack Poulson. 2016. Parallel

Matrix Multiplication: A Systematic Journey. SIAM J. Sci. Comput. 38

(2016).

[32] Ryan Senanayake, Changwan Hong, Ziheng Wang, Amalee Wilson,

Stephen Chou, Shoaib Kamil, Saman Amarasinghe, and Fredrik Kjol-

stad. 2020. A Sparse Iteration Space Transformation Framework for

Sparse Tensor Algebra. Proc. ACM Program. Lang. 4, OOPSLA, Article

158 (Nov. 2020), 30 pages. https://doi.org/10.1145/3428226

[33] Edgar Solomonik and James Demmel. 2011. Communication-Optimal

Parallel 2.5D Matrix Multiplication and LU Factorization Algorithms.

In Euro-Par 2011 Parallel Processing, Emmanuel Jeannot, Raymond

Namyst, and Jean Roman (Eds.). Springer Berlin Heidelberg, Berlin,

Heidelberg, 90–109.

[34] Edgar Solomonik, Devin Matthews, Jeff R. Hammond, John F. Stanton,

and James Demmel. 2014. A massively parallel tensor contraction

framework for coupled-cluster computations. J. Parallel and Distrib.

Comput. 74, 12 (2014), 3176–3190. https://doi.org/10.1016/j.jpdc.2014.

06.002 Domain-Specific Languages and High-Level Frameworks for

High-Performance Computing.

[35] Robert A. van de Geijn and Jerrell Watts. 1995. SUMMA: Scalable

Universal Matrix Multiplication Algorithm. Technical Report. USA.

[36] Nicolas Vasilache, Oleksandr Zinenko, Theodoros Theodoridis, Priya

Goyal, Zachary DeVito, William S. Moses, Sven Verdoolaege, An-

drew Adams, and Albert Cohen. 2018. Tensor Comprehensions:

Framework-Agnostic High-Performance Machine Learning Abstrac-

tions. arXiv:1802.04730 [cs.PL]

[37] Yunming Zhang, Mengjiao Yang, Riyadh Baghdadi, Shoaib Kamil,

Julian Shun, and Saman Amarasinghe. 2018. GraphIt: A High-

Performance Graph DSL. Proc. ACM Program. Lang. 2, OOPSLA, Article

121 (Oct. 2018), 30 pages. https://doi.org/10.1145/3276491