Top Qs
Timeline
Chat
Perspective

AI engine

From Wikipedia, the free encyclopedia

AI engine
Remove ads

AI engine is a computing architecture created by AMD (formerly by Xilinx, which AMD acquired in 2022[1]). It is commonly used for the of linear algebra operations[2], such as matrix multiplications, artificial intelligence[3][4] workloads, digital signal processing[5], and, more generally, high-performance computing[6][7]. The first products containing AI engines were the Versal adaptive compute acceleration platforms[8], which combine scalar, adaptable and intelligent engines, all connected through a Network on Chip (NoC)[9].

Thumb
Hardware chip example in modern architectures.

AI engines evolved significantly over the years, adapted to the continuous evolution of modern workload, and finding wide employment in AI applications. However, the basic architecture of a single AI engine integrates vector processor and scalar processor, offering Single Instruction Multiple Data (SIMD)[10][11] capabilities. In terms of products, AI engines are today integrated with many other architectures like FPGAs, CPUs, and GPUs, composing a plethora of architectures for high performance heterogeneous computation, widely employed in different domains[12][13][14].

"AI" does not stand for artificial intelligence or adaptable intelligent. Indeed, as specifically asserted by the company support, they do not mean any acronym for the AI word[15].

Remove ads

History

Summarize
Perspective

The AMD AI engines were originally released by Xilinx, an American company already famous for the FPGA advancements over the previous decades. Their initial goal was to accelerate signal processing and, more generally, applications where data parallelism could offer significant improvements. Initially, AI engines were released combined with an FPGA layer in the novel Versal platforms[8]. The initial systems, the VCK190 and VCK5000, contained 400 AI engines in their AI engine layer, connected through a VC1902. For connectivity, this architecture class relied on an innovative Network on Chip, a high-performance connectivity devised to become the core connectivity of modern FPGA fabric[9].

In 2022, the AI engine project evolved when Xilinx was officially acquired by AMD[1], another American company already famous in the computing architecture market. The AI engines were integrated with other computing systems to target a wider range of applications, finding benefits when considering AI workloads. Indeed, even though the Versal architecture proved powerful, it was complicated and unfamiliar to a vast academic and industrial community segment[12]. The complexity of the architecture posed a challenge for many users, which meant that it took a long time for many users to adapt because of the steep learning curve. For this reason, AMD, along with third-party developers, began releasing improved toolsets and software stacks aimed at simplifying the programming challenges posed by the platform, targeting productivity and programmability[16][17][18][19].

Aware of the AI workload needs, in 2023, AMD announced the AI engine ML (AIE-ML)[20], the second generation of such architecture. It added support for AI-specific data types like bfloat16[21], a common data type for deep learning applications. The version retained the same vector processing capabilities of the previous instance, but enlarged memory to support more intermediate computations[22]. From this generation, AMD integrates AI engines with other processing units like CPUs and GPUs, which are incorporated into modern Ryzen AI processors. In such systems, AI engines are usually referred to as Compute Tiles and are integrated with different other types of tiles[16][23], namely Memory tile and Shim tile. The apparatus containing the interconnected three kinds of tiles is named XDNA[24], and its first generation, namely XDMA 1, is released on Ryzen AI Phoenix PCs. Along with this release, AMD continues the research about programmability, releasing, as open source tool, Riallto[25].

On a similar path, at the end of 2023, early 2024, AMD announced the XDNA 2, along with the Strix series of Ryzen AI architectures [26] [27]. Different from the first generation of XDNA architectures, the second one offers more units to target the massive workload of ML systems. Again, to keep the efforts on the programmability side, AMD released the open source Ryzen AI SW toolchain, which includes the tools and runtime libraries for optimizing and deploying AI inference on Ryzen AI PC[24].

Lastly, as neural processing and deep learning applications are spreading across different domains, researchers and industry are referring to XDNA architectures as Neural Processing Units(NPUs). However, the term includes all those architectures specifically meant for deep learning workloads[28] and several companies, such as Huawei[29] and Tesla[30], are proposing their own alternative[29][30].

Remove ads

Hardware architecture

Summarize
Perspective

AI engine tile

Thumb
First generation of AI engine single tile scheme, offering a vector processor capability and a 32KB memory

A single AI engine is a 7-way VLIW[11][31] processor that offers vector and scalar capabilities, enabling parallel execution of multiple operations per clock cycle. The architecture includes a 128-bit wide vector unit capable of SIMD (Single Instruction, Multiple Data) execution, a scalar unit for control and sequential logic, and a set of load/store units for memory access. The maximum vector register size is 1024 bit, leading to different vector sizes depending on the vector data type[31] .

In the first generation, each AI engine tile has a 32KB memory to load partial computations and 16KB of program memory[31] .

AI engines are statically scheduled architectures. As widely studied in literature, static scheduling suffers from code explosion, requiring manual code optimizations when writing the AI engine kernel to handle this side effect[19][11].

The main programming language for a single AI engine is C++, used for both the connection declaration among multiple engines and the kernel logic executed by a specific AI engine tile[32]. However, different toolchains can offer support for other programming languages, targeting specific applications or offering automation[19].

First generation - the AI engine layer

Thumb
Multiple AI engine of first generation connected together in a single aie layer.

In the first generation of Versal systems, each AI engine is connected to multiple other engines through three main interfaces, namely cascade, memory and stream interfaces. Each one represents a possible communication mechanism of each AI engine with the others [6].

The AI engine layer of the first versal systems combined 400 AI engines together. Each AI engine has a 32KB memory that extended up to 128KB by using the memory of neighbouring engines. This leads to a reduced number of actual compute cores but ensures enlarged data memory[8][19].

Each AI engine can execute an independent function, or multiple functions by leveraging time multiplexing. The programming structure used to describe the AI engine instantiation, placement and connection is named AIE graph. The official programming model suggested by AMD requires writing such a file in C++. However, different programming toolchains, from both companies and research, can support different alternatives to improve programmability and/or performance[19][23].

To compile the application, the original toolchain relies on a closed-source AI engine compiler that automatically performs placement and routing, despite custom indications that can be given when writing the AIE graph[33].

As the AI engine were initially integrated in Versal systems only, thus combining AI engines with FPGAs capabilities and Network on Chip connectivity, this architectural layer also offers a limited number of direct communications with both of them. Such communications needs to be specified in both the AIE graph, to ensure a correct placement of the AI engines, and during the system-level design[19][7].

Second generation - the AI engine ML

The second generation of AMD's AI engines, or AI engine ML (AIE-ML), provides some architectural modifications with respect to the first generation, focusing on performance and efficiency for machine-learning workloads[22].

AIE-ML possesses almost twice the density of computing per tile, improved memory bandwidth, and natively supports data types with more AI inference workload-optimized formats such as INT8 and bfloat formats. These optimizations allow the second-generation engine to deliver up to three times more TOPS per watt than the underlying AI engine, which was primarily built for DSP-heavy workloads and required explicit SIMD programming and hand-coded data partitioning[3].

Recent publications from researchers and institutions[34] confirm that AIE-ML offers more scalable, more on-chip memory, and more computational power[3], making it better suited for edge-based modern ML inference workloads. These advances collectively counter the limitations of the first generation[22].

According to the company official documentation, there are some key similarities and differences between the two architectures[22].

More information similarities between AIE-ML and AIE, differences between AIE-ML and AIE ...

XDNA 1

Thumb
Simplified diagram of an AMD XDNA NPU such as found in Ryzen 7040 processors [23]

The XDNA is the hardware layer combining three types of tiles[23][24]:

  • The Compute Tile (AI engine ML) is responsible for executing vector and scalar operations.
  • The Memory Tile is responsible for 512 KB of local memory and computes pattern-specific data movements to upstream Compute Tile fetch requests.
  • The ShimTile, which handles the host memory interaction, controls the data exchanges between Memory and Compute Tiles.

The XDNA architecture is combined with other architectural layers such as CPUs and GPUs, for Ryzen AI Phoenix architectures, composing the AMD product line for energy-efficient inference and AI workloads[23].

XDNA 2

Second generation of XDNA layers is integrated within Ryzen AI Strix architecture and official documents from the producer claim it as specifically tailored for LLM inference workloads[24].

Remove ads

Tools and programming model

The main programming environments for AI engine, officially supported by AMD, are the Vitis flow, which uses the Vitis toolchain to program the hardware accelerator[32][35][7].

Thumb
AMD Vivado logo

Vitis offers support for both hardware and software developers using a unified development environment, including high-level synthesis, RTL-based flows, and domain-specific libraries [36]. Vitis enables applications to be deployed onto heterogeneous platforms, including AI engines, FPGAs, and scalar processors[36].

Newer architectures are rather moving towards a design approach utilizing Vitis for hardware and IP design, while relying on Vivado for system integration and hardware setup. Vivado[37], is also a part of the AMD toolchain ecosystem, is primarily utilized for RTL design and IP integration and offers a GUI-based design environment to design block designs and manage synthesis, implementation, and bitstream generation[37].

About the AI engine layer, the main programming language for a single AI engine is C++, used for both the connection declaration among multiple engines and the kernel logic executed by a specific AI engine tile[32].

Research toolchains

Summarize
Perspective

Parallely to the company efforts in proposing programming models, design flows and tools, researchers also proposed their own toolchains targeting programmability, performance, or simplifying development for a subset of applications[19][38][23][18].

Following some of the main research toolchains are brefly described[39][19][38][18].

  • IRON is an open-source toolchain developed by AMD in collaboration with several researchers. IRON toolchain uses MLIR as its middle representation[39]. At the user level, IRON permits a Python API for placing and orchestrating multiple AI engines. Such Python code is then translated into MLIR using one of the two possible backends: a Vitis-based backend and an open-source backend using the Peano compiler[23]. IRON still relies on C++ for kernel development, supporting all the APIs of the standard AI engine kernel development flow[23].
  • ARIES (An Agile MLIR-Based Compilation Flow for Reconfigurable Devices with AI engines) presents a high-level, tile-based programming model and shared MLIR intermediate representation encompassing both AI engines and FPGA fabric. It represents task-level, tile-level, and instruction-level parallelism in MLIR and accommodates global and local optimization passes. ARIES generates compact C++ code for AI engine kernels and data-movement logic, allowing kernel specification through Python. In contrast to IRON, ARIES also supports FPGA programming and PL integration is done automatically[19].
  • EA4RCA is aimed at a specialized subclass of algorithms, regular Communication-Avoiding algorithms. EA4RCA introduces a design environment optimized for the Versal heterogeneity, emphasizing AI engine performance and high-speed data streaming abstractions. EA4RCA is aimed at algorithms exhibiting regular communication patterns to make the most out of parallelism and hierarchies of memory in the Versal platform[38].
  • CHARM is a framework to compose multiple diverse matrix multiplication accelerators working concurrently towards different layers within one application. CHARM includes analytical models which guide design space exploration to determine accelerator partitions and layer scheduling[18].
Remove ads

See also

References

Further reading

Loading related searches...

Wikiwand - on

Seamless Wikipedia browsing. On steroids.

Remove ads