Summary DISTAL Distributed Tensor Algebra Compiler arxiv.org
13,481 words - PDF document - View PDF document
One Line
DISTAL is a high-performance distributed tensor algebra compiler that optimizes computation and data distribution in heterogeneous systems, generating competitive code for matrix multiplication and achieving near-peak utilization on GPUs.
Key Points
- DISTAL is a distributed tensor algebra compiler that optimizes computation and data distribution in distributed systems.
- It allows users to describe tensor and computation mapping independently.
- DISTAL supports a distributed task-based runtime system and can compile a tensor algebra domain-specific language for modern distributed and heterogeneous systems.
- The compiler generates code for GPUs, optimizing and scheduling variables.
- DISTAL provides a concise and flexible way to describe tensor distribution across machines, enabling efficient computation on modern high-performance systems.
Summaries
297 word summary
DISTAL is a distributed tensor algebra compiler that optimizes computation and data distribution in distributed systems. It decomposes tensor algebra expressions into distributed matrix multiplication and transposition operations. DISTAL generates optimized code for GPUs and manages non-uniform memory access costs. It includes a data distribution language, a compiler for tensor computations, and an extension of the TACO runtime system. DISTAL provides a concise and flexible way to describe tensor distribution across machines, enabling efficient computation on modern high-performance systems. The core abstractions include modeling modern machines, using dimension variables, tensor distribution, and data distribution. The compiler allows users to specify the sparse format of tensors and introduces three new scheduling commands for computations. It achieves a speedup over existing systems on dense matrix-matrix multiplication. The DISTAL Compiler enables efficient communication and optimization in tensor operations, supporting hierarchical machine models, data distribution, distributed reductions, and the owner-computes paradigm for distributed computations. It models a distributed machine as a grid of processors and helps with tensor distribution on distributed systems. It introduces the concepts of aggregation and rotation to improve performance and supports various algorithms for rectangular matrices with tiled distribution. The DISTAL Distributed Tensor Algebra Compiler (DISTAL) is a high-performance system that optimizes tensor operations by implementing a tensor distribution notation and supporting various scheduling operations. It focuses on dense tensor algebra for heterogeneous machines, allowing users to create distributed implementations of tensor computations with various formats. DISTAL outperforms other systems and generates competitive code for matrix multiplication. It incorporates parallel universal matrix multiplication algorithms and is compatible with distributed memory concurrent computers. DISTAL aims to generate sophisticated distributed algorithms without extensive user input. It performs well on both CPUs and GPUs, achieving near-peak utilization on GPUs by keeping data in GPU framebuffer memory and communicating via NVLink.
584 word summary
DISTAL is a compiler that focuses on dense tensor algebra for heterogeneous machines. It allows users to create distributed implementations of tensor computations with various formats. DISTAL outperforms other systems and generates competitive code for matrix multiplication. It incorporates parallel universal matrix multiplication algorithms and is compatible with distributed memory concurrent computers. DISTAL aims to generate sophisticated distributed algorithms without extensive user input. In comparison to the Cyclops Tensor Framework (CTF), DISTAL offers similar capabilities but has some shortcomings. DISTAL performs well on both CPUs and GPUs, achieving near-peak utilization on GPUs. It keeps data in GPU framebuffer memory and communicates via NVLink. The DISTAL Distributed Tensor Algebra Compiler (DISTAL) is a system that optimizes tensor operations by implementing a tensor distribution notation and supporting various scheduling operations. It achieves high performance on different kernels, utilizing GPUs and CPUs for matrix multiplication tasks. The compiler optimizes task placement and partitioning through a mapper and bounds analysis procedure.
DISTAL introduces the concepts of aggregation and rotation to improve performance and supports various algorithms for rectangular matrices with tiled distribution. It utilizes a 2D grid and scheduling techniques similar to SUMMA, Cannon's, and Johnson's algorithms. Examples, pseudocode, and visualizations are provided to illustrate these concepts.
DISTAL models a distributed machine as a grid of processors, allowing users to express a virtual machine organization and expose locality in the model. It helps with tensor distribution on distributed systems by allowing users to map tensors onto machines using a tensor distribution notation statement. The tool supports replication of tensor tiles and communication between different parts of the machine hierarchy. Iteration spaces, distribute operation, and execution spaces are discussed to explain the mapping of computations. Examples of tensor algebra expressions and optimal computation methods are provided.
The DISTAL Compiler enables efficient communication and optimization in tensor operations, supporting hierarchical machine models, data distribution, distributed reductions, and the owner-computes paradigm for distributed computations.
In summary, DISTAL provides a concise and flexible way to describe tensor distribution across machines, enabling efficient computation on modern high-performance systems. The core abstractions include modeling modern machines, using dimension variables, tensor distribution, and data distribution. The compiler allows users to specify the sparse format of tensors and introduces three new scheduling commands for computations. It achieves a speedup over existing systems on dense matrix-matrix multiplication. The document includes an overview of DISTAL's implementation and contributions. The DISTAL compiler generates optimized code for GPUs, facilitating communication between processors and splitting the k loop into chunks. It distributes the i and j tiles across all GPUs and maps the computation onto the target machine. DISTAL includes a data distribution language, a compiler for tensor computations, and an extension of the TACO runtime system. It uses Legion programs to interface with a mapper for data and computation placement, allowing users to specialize computation for target machines and optimize data movement.
DISTAL decomposes tensor algebra expressions into distributed matrix multiplication and transposition operations. It provides abstractions for defining data and computation distribution, enabling independent optimization and adaptability to the machine through loop transformation-based scheduling. DISTAL manages non-uniform memory access costs within a single compute node, targeting supercomputers.
DISTAL is a distributed tensor algebra compiler that optimizes computation and data distribution in distributed systems. Users can independently describe tensor and computation mapping. The generated code is competitive with optimized codes for matrix multiplication on multi-core CPUs and multiple GPUs. DISTAL supports a distributed task-based runtime system and can compile a tensor algebra domain-specific language for modern distributed and heterogeneous systems.
1148 word summary
DISTAL is a distributed tensor algebra compiler that optimizes computation and data distribution in distributed systems. It allows users to describe tensor and computation mapping independently. The generated code is competitive with optimized codes for matrix multiplication on multi-core CPUs and multiple GPUs. DISTAL supports a distributed task-based runtime system and can compile a tensor algebra domain-specific language for modern distributed and heterogeneous systems.
DISTAL decomposes tensor algebra expressions into distributed matrix multiplication and transposition operations. It provides abstractions for defining data and computation distribution, allowing for independent optimization. It can adapt data or computation distribution to the machine through loop transformation-based scheduling. DISTAL manages non-uniform memory access costs between multiple GPUs and CPU sockets within a single compute node. The output of DISTAL targets supercomputers.
The compiler generates code for GPUs, optimizing and scheduling variables. Communication occurs between processors, and the k loop is split into chunks. The i and j tiles are distributed over all GPUs, and the computation is mapped onto the target machine. The specific contributions of this work include a data distribution language, a compiler for tensor computations, and an implementation of DISTAL that extends the TACO runtime system. DISTAL uses Legion programs to interface with a mapper for data and computation placement. It allows users to specialize computation to target machines and offers optimization for data movement.
The DISTAL compiler allows users to specify the sparse format of tensors and introduces three new scheduling commands for computations. It can generate a fused kernel for the entire tensor index notation. Computation is described using tensor index notation, with background information on each component of DISTAL provided. DISTAL achieves a speedup over existing systems on dense matrix-matrix multiplication. The document includes an overview of DISTAL's implementation and contributions.
In summary, DISTAL provides a concise and flexible way to describe tensor distribution across machines, enabling efficient computation on modern high-performance systems. The core abstractions of DISTAL include modeling modern machines and using dimension variables, tensor distribution, and data distribution. These abstractions allow users to map data and computation onto distributed machines. DISTAL is a tool that models a distributed machine as a grid of processors, allowing users to express a virtual machine organization and expose locality in the model. It helps with tensor distribution on distributed systems by allowing users to map tensors onto machines using a tensor distribution notation statement. The tool supports replication of tensor tiles and communication between different parts of the machine hierarchy. The document discusses the concept of iteration spaces and their mapping onto execution spaces, introduces the distribute operation to transform iterations, and explains execution spaces that model the computation process. It provides examples of tensor algebra expressions and optimal computation methods. The document also explains tensor distribution, including hierarchical data distributions, and provides examples of tensor distributions and their mappings. The DISTAL Compiler enables efficient communication and optimization in tensor operations, supports hierarchical machine models and data distribution, and provides commands for distributing variables and dimensions onto processors in the machine. The compiler supports hierarchical machine models, data distribution, distributed reductions, and the owner-computes paradigm for distributed computations. The DISTAL Distributed Tensor Algebra Compiler is a tool that optimizes communication and memory usage in tensor operations. It introduces the concepts of aggregation and rotation to improve performance. The compiler supports various algorithms and can handle rectangular matrices with tiled distribution. It utilizes a 2D grid and scheduling techniques similar to SUMMA, Cannon's, and Johnson's algorithms. The document provides examples, pseudocode, and visualizations to illustrate these concepts. Solomonik's, PUMMA, Cannon's, and Johnson's algorithms are discussed in detail, with communication patterns and pseudocode provided for Cannon's and Johnson's algorithms. The DISTAL Distributed Tensor Algebra Compiler (DISTAL) is a system that implements a tensor distribution notation and supports various scheduling operations. It uses GPUs and CPUs for matrix multiplication tasks and compares algorithms such as COSMA, ScaLAPACK, CTF, and SUMMA. DISTAL achieves high absolute performance on different kernels, with experiments conducted on the Lassen supercomputer using NVIDIA Volta V100 GPUs and IBM Power9 CPUs. The system optimizes task placement and partitioning through a mapper and bounds analysis procedure. In terms of performance on matrix multiplication tasks, DISTAL performs well on both CPUs and GPUs. On CPUs, it performs equally or better than other systems, especially with 4 cores per node. On GPUs, DISTAL's performance is competitive, with all of its kernels achieving twice the performance of COSMA. The system keeps data in GPU framebuffer memory and communicates via NVLink, achieving near-peak utilization. DISTAL is compared to the Cyclops Tensor Framework (CTF) in terms of generality and is found to offer similar capabilities. However, there are some shortcomings to be addressed in the future. DISTAL's kernels perform worse than COSMA and have larger performance variations due to communication costs. Solomonik's 2.5D algorithm outperforms Johnson's algorithm on non-square node counts. COSMA achieves high performance when the number of GPUs is a perfect cube but performs worse for non-cubes. Cannon's algorithm performs better than SUMMA and PUMMA in terms of communication in the rectangular case. Overall, COSMA's performance varies depending on node counts, with 2D algorithms performing well at square node counts and achieving peak GFLOP/s on a single GPU. The DISTAL Distributed Tensor Algebra Compiler (DISTAL) is a system that focuses on separating data and computation distributions. It utilizes the same strategy as CTF on a single node computation and scheduling leaf kernels is important for achieving peak utilization. It outperforms CTF in generating a bespoke implementation for a target kernel and reshaping tensors. DISTAL provides a framework for extending relational database engines to support distributed algebra algorithms and implements specialized algorithms for tensor algebra operations. The GPU implementation of the inner-product kernel in DISTAL has similar performance characteristics as the CPU implementation. DISTAL also outperforms CTF in terms of bandwidth and execution time. It allows for the expression of various distributed algorithms and tensor distributions, and it combines data distribution descriptions with computation scheduling. Unlike other systems, DISTAL aims to generate sophisticated distributed algorithms without requiring extensive user input. DISTAL is a compiler for dense tensor algebra that targets modern, heterogeneous machines. It allows for the independent specification of computation and data distribution, enabling users to create distributed implementations of any desired tensor computation with any set of tensor formats. DISTAL outperforms existing systems and generates competitive code for matrix multiplication. The compiler focuses on tensor algebra compilation and distributed numerical and machine learning computations, including tensor contractions on GPUs. It automatically generates communication code for distributed programs and is designed to optimize parallel recursive rectangular matrix multiplication. DISTAL incorporates parallel universal matrix multiplication algorithms and is compatible with distributed memory concurrent computers. The paper cites various research papers and conference proceedings related to parallel computation, high-performance computing, and code generation for distributed memory machines. These references provide valuable insights and techniques in the field of distributed tensor algebra and high-performance computing.
2709 word summary
In this document, several references are cited that are related to distributed tensor algebra compilers and high-performance computing. The references include papers on performance graph DSL, framework-agnostic high-performance machine learning, tensor comprehensions, domain-specific languages and high-level frameworks, massively parallel tensor contraction, communication-optimal sparse tensor algebra, distributed tensor computations, Halide language and compiler, scalable linear algebra on a relational database system, high-performance Fortran, Red-Blue Pebbling and applications, and tensor decompositions. These references provide valuable insights and techniques in the field of distributed tensor algebra and high-performance computing. The DISTAL Distributed Tensor Algebra Compiler is a program that focuses on tensor algebra compilation and distributed numerical and machine learning computations. It includes tensor contractions on GPUs and generates communication code automatically for distributed programs. The compiler is designed to optimize parallel recursive rectangular matrix multiplication and utilizes optimal parallel programming models. The program also incorporates parallel universal matrix multiplication algorithms and is compatible with distributed memory concurrent computers. The document discusses the DISTAL Distributed Tensor Algebra Compiler, a scalable linear algebra library for distributed memory concurrent computers. It also mentions other relevant works such as ScaLAPACK and TVM, which are optimizing compilers for deep learning. The paper cites various research papers and conference proceedings related to parallel computation, high-performance computing, and code generation for distributed memory machines. The work was supported by the Department of Energy and the National Nuclear Security Administration. Rohan Yadav, Alex Aiken, and Fredrik Kjolstad have developed DISTAL, a compiler for dense tensor algebra that targets modern, heterogeneous machines. DISTAL allows for the independent specification of desired computation and data distribution, enabling users to create distributed implementations of any desired tensor computation with any set of tensor formats. It outperforms existing systems and generates code competitive with hand-optimized implementations of matrix multiplication.
Future work includes extending DISTAL with support for sparse tensors, exploring auto-scheduling and auto-formatting frameworks, and investigating its potential applications in training and evaluating distributed deep learning models. The static-dynamic approach of DISTAL allows for the expression of complex communication patterns and data distributions statically, while discharging lower level data movement operations to a runtime system. This design decision provides flexibility in expressing different algorithms and adaptability when integrating with existing codes. DISTAL is a distributed tensor algebra compiler that focuses on separating data and computation distributions. It allows for the expression of various distributed algorithms and tensor distributions. DISTAL is the first system of its kind and differs from previous work by Bondhugula and Amarasinghe et al. DISTAL utilizes static analysis to determine communication partners and generates runtime calls to complement the distribution of processors. It starts from a higher level representation that enables the expression of communication information for computation distribution. DISTAL combines data distribution descriptions with computation scheduling and supports different distributed layouts. Unlike other systems like Tiramisu, DISTAL does not have a data distribution language or a rotate command. It aims to generate sophisticated distributed algorithms without requiring extensive user input. Other systems that support targeting distributed machines include Distributed Halide and Tiramisu. DSL compilers have been developed for single-node linear algebra and distributed tensor computations. DISTAL provides a framework for extending relational database engines to support distributed algebra algorithms. Distributed algorithms for tensor algebra can improve upon the interpreted approach used by the Cyclops Tensor Framework (CTF). The Legion team plans to address the performance shortcomings of their system when dealing with replicated regions and inter-node communication. DISTAL implements specialized algorithms for tensor algebra operations and achieves high efficiency on both CPUs and GPUs. The GPU implementation of the inner-product kernel in DISTAL has similar performance characteristics as the CPU implementation. DISTAL also outperforms CTF in terms of bandwidth and execution time. The DISTAL schedule using matrix-multiplications improves performance for TTV, innerprod, and MTTKRP operations. The DISTAL Distributed Tensor Algebra Compiler (DISTAL) utilizes the same strategy as CTF on a single node computation. Leaf kernel performance heavily impacts the overall performance of a distributed computation. Scheduling leaf kernels is important for achieving peak utilization on a single node. The initial problem algorithm (MTTKRP) and distribution schedules were used for CPU and GPU kernels. Input tensors were distributed in a row-major layout to minimize inter-node communication. CTF's GPU backend was not built, so only CPU results are reported. Performance was best at 40 ranks per node. Weak-scaling experiments showed that DISTAL outperforms CTF on each higher-order tensor expression. DISTAL's scheduling primitives provide a mechanism for future work to target when automatically scheduling computations for distribution. While CTF fully automates the distribution process, users must provide a schedule for their computations. DISTAL allows for the development of a schedule that implements an optimal strategy for each kernel. It outperforms CTF in generating a bespoke implementation for a target kernel and reshaping tensors. CTF decomposes arbitrary tensor operations into calls to polyadic decompositions of tensors. These operations, such as TTM and MTTKRP, have real-world applications. We are considering tensor expressions and evaluating the generality of our system, DISTAL. We compare it to the Cyclops Tensor Framework (CTF) and find that DISTAL offers similar generality. The Legion team plans to address some shortcomings in the future. Our kernels perform worse than COSMA and experience larger performance variations due to communication costs. Solomonik's 2.5D algorithm achieves better performance than Johnson's algorithm on non-square node counts. Our COSMA implementation achieves high performance when the number of GPUs is a perfect cube, but for non-cubes, it achieves worse performance. Cannon's algorithm outperforms SUMMA and PUMMA in terms of communication in the rectangular case. Overall, the performance of COSMA varies depending on node counts. The 2D algorithms perform well at square node counts and achieve peak GFLOP/s on a single GPU. The DISTAL Distributed Tensor Algebra Compiler is compared to other systems in terms of performance on matrix multiplication tasks. The experiments are conducted on both CPUs and GPUs. DISTAL's kernels keep data in GPU framebuffer memory and communicate via NVLink, achieving near-peak utilization. On CPUs, DISTAL performs equally or better than other systems, with the best performance achieved with 4 cores per node. On GPUs, DISTAL's performance is also competitive, with all of its kernels achieving twice the performance of COSMA. The results show that DISTAL is a promising compiler for distributed tensor algebra computations. DISTAL is a distributed tensor algebra compiler that utilizes GPUs and CPUs to perform matrix multiplication. The system compares different algorithms, including COSMA, ScaLAPACK, CTF, and SUMMA. The evaluation shows that DISTAL achieves high absolute performance on various kernels. The experiments were conducted on the Lassen supercomputer using NVIDIA Volta V100 GPUs and IBM Power9 CPUs. The code was compiled with GCC 8.3.1 -O3 and CUDA 11.1. The system utilizes a mapper and bounds analysis procedure to optimize task placement and partitioning. Our process for lowering to Legion involves using communication and GASNet-EX for inter-node communication. Legion handles the movement of data through specialized channels and manages the allocation of memory regions. Communication in Legion is implicit, and the desired data and memory allocations are described through Legion's mapping interface. Tasks are the unit of computation in Legion, and regions are used to represent distributed data structures. We follow a similar process as TACO for GPU lowering. Legion performs dynamic analysis for communication and supports features necessary for high performance on modern machines. The implementation involves steps such as constructing a concrete index notation statement and translating it into Legion's API. The text excerpt is discussing the implementation of a tensor distribution notation and the scheduling operations supported in the Distributed Tensor Algebra Compiler (DISTAL). It mentions the placement of tensors into a distribution, the lowering of tensor distribution notation, and the communication and rotation operations. It also describes the divide and distribute transformations, as well as the concrete index notation used in the compiler. The text highlights that the DISTAL compiler implements the distribution layer of COSMA and supports various scheduling operations. Solomonik's algorithm is similar to 2D algorithms and operates on a processor cube. It uses a broadcast to communicate matrices and applies systolic patterns. The PUMMA algorithm is a hybrid of Solomonik's Algorithm and COSMA. Johnson's algorithm distributes matrices to different faces of the processor cube and partitions them. The algorithm targets a 3D processor grid. Cannon's algorithm rotates each processor's iteration space and changes the communication pattern. Figure 12 illustrates the communication pattern in Cannon's algorithm. The pseudocode for Cannon's and Johnson's algorithms is shown in Figures 11 and 13, respectively. The DISTAL Distributed Tensor Algebra Compiler is a tool that allows for efficient computation of tensor operations. It utilizes a schedule similar to SUMMA's algorithm, where processors shift tiles along their rows and columns. Cannon's algorithm is also used, where processors broadcast chunks of data to other processors in their rows and columns. The communication pattern and target machine organization are organized as a 2D grid. The compute statement is scheduled using SUMMA, with each processor performing local matrix multiplications on chunks of data. The dimension of the iteration space is distributed, and communication is scheduled accordingly. The algorithm is represented by a set of matrix-multiplication algorithms that can be implemented using DISTAL. The document includes pseudocode examples and mentions various commands used in the scheduling process. DISTAL is a distributed tensor algebra compiler. It supports various algorithms like SUMMA, Cannon's, and Johnson's. The algorithms are implemented using a 2D grid and communication patterns are optimized to reduce communication. The compiler can handle rectangular matrices and supports tiled distribution of input matrices. The techniques used in DISTAL are not specific to matrix-multiplication and can be applied to other tensor algebra operations as well. The DISTAL Distributed Tensor Algebra Compiler introduces the concept of rotate, which allows for the same iteration of a loop to occur at different times. This is achieved by rotating the execution space in time so that each processor executes the same iteration at the same point in time. The rotate operation is used to optimize the communication pattern in distributed algorithms and can improve performance by avoiding contention for the same pieces of data. The communicate command is not necessary for correctness but can be used to further optimize performance. The scheduling language used in DISTAL only affects performance, not correctness. The document provides examples and visualizations to illustrate the concepts of rotate and systolic communication patterns. The DISTAL Distributed Tensor Algebra Compiler allows for efficient communication and optimization in tensor operations. The communicate(T, i) command aggregates communication for each tensor into a single message, allowing for optimization of memory usage and tradeoff between memory usage and communication frequency. The choice of how much communication to aggregate affects the execution space and can be visualized in Figure 7b. Communication operations can be made more efficient by aggregating them into larger operations that fetch the data from memory. The communicate command is automatically inserted at each iteration space point where data needs to be communicated.
The DISTAL compiler also supports hierarchical machine models and data distribution. The distribute command can be applied hierarchically to match machine models and distribute data into the output tensor, trading space usage for increased parallelism. The compiler also supports distributed reductions and owner-computes paradigm for distributed computations.
The document provides code examples and commands for distributing variables and dimensions onto processors in the machine. The reorder and divide commands are used to map iteration space dimensions onto each processor. The compound command demonstrates the use of distribute, divide, and reorder commands to tile the iteration space dimensions onto a machine.
Overall, DISTAL Compiler enables efficient communication and optimization in tensor operations, supports hierarchical machine models and data distribution, and provides commands for distributing variables and dimensions onto processors in the machine. The document discusses the DISTAL Distributed Tensor Algebra Compiler. It explains the concept of iteration spaces and how they are mapped onto execution spaces. The distribute operation is introduced as a way to transform the execution of iterations. The document also mentions the execution spaces and how they model the computation process. It provides an example of a tensor algebra expression and discusses the optimal way to compute it. The concept of tensor distribution is explained, including hierarchical data distributions. The document concludes with examples of tensor distributions and their corresponding mappings. The DISTAL Distributed Tensor Algebra Compiler is a tool that helps with tensor distribution on distributed systems. The tool allows for the mapping of tensors onto machines in different ways. It uses a tensor distribution notation statement to define the mapping. The statement consists of a tensor, a mapping function, and a machine. The mapping function assigns coordinates of the tensor to processors in the machine. The machine can have multiple dimensions, and the mapping function can specify fixed dimensions, broadcasted dimensions, and colored dimensions. The tool also supports replication of tensor tiles and communication between different parts of the machine hierarchy. The DISTAL Distributed Tensor Algebra Compiler is a tool that allows users to express how tensors are distributed across machines. It uses a tensor distribution notation to describe the mapping of tensor dimensions to machine dimensions. The notation includes statements that partition tensor dimensions across machine dimensions and can fix the partition or broadcast it. The syntax for tensor distribution notation is described in Figure 4.
DISTAL models a distributed machine as a multidimensional grid of abstract processors. Each processor has associated local memory and can communicate with all other processors. This grid abstraction allows users to express a virtual machine organization and expose locality in the model.
The core abstractions of DISTAL include the modeling of modern machines and the use of dimension variables, tensor distribution, and data distribution. These abstractions allow users to map data and computation onto a distributed machine.
Overall, DISTAL provides a concise and flexible way to describe how tensors are distributed across machines, allowing for efficient computation on modern high-performance systems. The DISTAL Distributed Tensor Algebra Compiler is described in a document. The compiler allows users to specify the sparse format of tensors and introduces three new scheduling commands for computations. It can generate a single fused kernel for the entire tensor index notation. Computation is described using tensor index notation, and the document provides background information on each of the three components of DISTAL. DISTAL's performance is compared to other systems, and it achieves a speedup over existing systems on dense matrix-matrix multiplication. The document also includes an overview of DISTAL's implementation and contributions. The DISTAL Distributed Tensor Algebra Compiler is discussed in this summary. The compiler generates code for GPUs and allows for optimization and scheduling of variables. Communication occurs between processors, and the k loop is split into chunks. Each i and j tile is distributed over all GPUs, and the computation is mapped onto a target machine. The specific contributions of this work include a data distribution language, a compiler for tensor computations, and an implementation of DISTAL that extends the TACO runtime system. The compiler generates Legion programs that interface with a mapper to place data and computation onto memories and processors. DISTAL lets users specialize computation to target machines and offers optimization for data movement. Libraries like ScaLAPACK offer similar functionality for distributed machines. DISTAL is a distributed tensor algebra compiler that allows users to generate bespoke implementations of tensor algebra expressions. It can decompose tensor algebra expressions into distributed matrix multiplication and transposition operations. DISTAL provides abstractions for defining the distribution of both data and computation, allowing for independent optimization. It also allows for adapting the data or computation distribution to the machine through loop transformation-based scheduling. DISTAL can be used to create implementations of any dense tensor algebra expression for modern heterogeneous systems. It manages the non-uniform memory access costs between multiple GPUs and CPU sockets within a single compute node. The output of DISTAL is code targeting supercomputers. DISTAL is a distributed tensor algebra compiler that aims to optimize the computation and data distribution of tensor algebra kernels in distributed systems. It allows users to independently describe how tensors and computation map onto target machines. The code generated by DISTAL is competitive with optimized codes for matrix multiplication on distributed systems with multi-core CPUs and multiple GPUs. DISTAL supports a distributed task-based runtime system and can compile a tensor algebra domain-specific language to target modern distributed and heterogeneous systems.