Intel AMX (Advanced Matrix Extensions) Explained

Intel AMX accelerator on the horizon.


The recent launch of AI-based end-user applications, like ChatGPT, captured the public’s attention and sent ripples throughout the business community. For many SMEs, once elusive concepts like AI, machine learning, and big data suddenly became tangible business opportunities. To support this AI-driven transformation, organizations need significantly more processing power.

For this reason, Intel has equipped its 4th Gen Xeon Scalable processors with the AMX extension to speed up high-performance computing workloads.

Learn more about AMX and use it to optimize your AI development pipelines.

Intel AMX accelerator on the horizon.

What Is Intel AMX (Advanced Matrix Extensions)?

Intel Advanced Matrix Extensions (AMX) is an instruction set extension integrated into 4th Gen Intel Xeon Scalable CPUs. The AMX extension is designed to accelerate matrix-oriented operations, which are primarily used in training deep neural networks (DNNs) and AI inference.

By introducing new data types and instructions, AMX aims to streamline matrix multiplication and accumulation operations and reduce power consumption.

Deep learning frameworks like PyTorch and TensorFlow can leverage AMX data types and instructions. This allows developers to run and optimize AI inferencing routines without handling hardware specifics.

The relationship between machine learning, deep learning, and AI.

AMX Architecture

AMX uses an innovative tile structure to increase the density of matrix operations. By facilitating parallel computations, it opens the way for significant performance enhancements in AI and machine learning tasks.

The most important components of Intel AMX architecture include:

1. Tile-Based Architecture. Large data chunks are stored in two-dimensional (2D) 1 kilobyte register files. Data is formatted into a set of eight 2D register files called a tile. Tiles are designed to keep data near the execution units, which improves data reuse and reduces memory bandwidth requirements for matrix operations.

2. Tile Matrix Multiplication (TMUL): TMUL is an accelerator engine for controlling and managing tiles and their states. It focuses on matrix-multiply computations like dense linear algebra workloads, essential for AI training and inference.

Intel AMX high level architecture.

The tile-based architecture allows Intel AMX to store more data in each core and compute larger matrices in a single operation.

AMX Data Types

The FP32 floating-point format is used in AI workloads for its precision. It is ideal for higher accuracy but requires more computing resources and longer computation times, which may not be practical for all applications.

AMX supports lower precision INT8 and BF16 data types:

  • INT8 for Inferencing. AMX provides enhanced support for INT8 operations, which are critical for inferencing workloads. The INT8 data type sacrifices precision to process multiple operations in each compute cycle. It requires fewer computing resources, which makes it ideal for deployment in real-time applications and matrix multiplication tasks where speed and efficiency take precedence.
  • Bfloat16 (BF16) Support. AMX provides native support for the BF16 floating-point format. This data type occupies 16 bits of computer memory. BF16 delivers intermediate accuracy for most AI training workloads and can also deliver higher accuracy for inferencing if needed. It is particularly useful in ML because it allows models to be trained with almost the same accuracy as when using 32-bit floating-point numbers but at a fraction of the computational cost.

Note: Open-source frameworks such as TensorFlow and PyTorch are optimized for INT8 and BF16 operations by default.

The tiled architecture and native support for the BF16 data type give Intel CPUs with integrated AMX acceleration a significant performance advantage over their predecessors.

The table shows performance utilization for different data types, detailing operations per cycle for the 3rd Gen Intel Xeon (Intel AVX-512 VNNI) and 4th Gen Intel Xeon (Intel AMX) processors.

3rd Gen Intel Xeon (Intel AVX-512 VNNI) 4th Gen Intel Xeon (Intel AMX) SPEED INCREASE
Operations per Cycle
(Data Type)
64 (FP32) 1024 (BF16) AMX is 16x faster
Operations per Cycle
(Data Type)
256 (INT8) 2048 (INT8) AMX is 8x faster

Note: 4th Gen Intel Xeon processors can transition between Intel AMX and Intel AVX-512, selecting the most efficient instruction set based on workload requirements.

AMX Performance

Relative Throughput

The semiconductor industry has consistently doubled computing power roughly every two years.

The following table shows that AMX architecture outperforms the incremental core count across various Xeon processor generations. Although the number of cores has only doubled since the first Intel Xeon Scalable processor, the relative throughput has increased 11 times.

Performance test parameters include:

  • Reference Point: 1st Gen Intel Xeon Scalable processor (Intel DL Boost Instruction Set).
  • Deep Learning Model: ResNet-50 v1.5 (Batch Inferencing).
  • Framework: TensorFlow.
  • Data Type: INT8.
Processor Generation Instruction Set Extention Cores Relative Throughput
Intel Xeon Scalable CPU Intel DL Boost 28 cores Baseline
2nd Gen Intel Xeon Scalable CPU Intel DL Boost 28 cores 2x
3rd Gen Intel Xeon Scalable CPU Intel DL Boost 40 cores 4x
4th Gen Intel Xeon Scalable CPU Intel AMX 56 cores 11x

AI Training Performance Boost

This table illustrates the acceleration in PyTorch training performance when using the 4th Gen Intel Xeon Platinum 8480+ processor (Intel AMX BF16) compared to the 3rd Gen Intel Xeon Platinum 8380 processor (FP32).

Performance test details are:

  • Reference Point: 3rd Gen Intel Xeon Platinum 8380 processor (FP32).
  • Framework Used: PyTorch.
Task/Model Category Performance Increase
ResNet-50 v1.5 Image classification 3x
BERT-large Natural Language Processing (NLP) 4x
DLRM Recommendation system 4x
Mask R-CNN Image segmentation 4.5x
SSD-ResNet-34 Object detection 5.4x
RNN-T Speech recognition 10.3x

Real-Time Inference Performance Boost

The table below illustrates the generation-to-generation performance increase in PyTorch real-time inference when using the 4th Gen Intel Xeon Platinum 8480+ processor (Intel AMX BF16) compared to the 3rd Gen Intel Xeon Platinum 8380 processor (FP32).

Performance test parameters are:

  • Reference Point: 3rd Gen Intel Xeon Platinum 8380 processor (FP32).
  • Framework: PyTorch.
Task/Model Category Performance Increase
ResNeXt101 32x16d Image classification 5.70x
ResNet-50 v1.5 Image classification 6.19x
BERT-large Natural Language Processing (NLP) 6.25x
Mask R-CNN Image segmentation 6.24x
RNN-T Speech recognition 8.61x
SSD-ResNet-34 Object detection 10.01x

Intel AMX on phoenixNAP BMC Platform

Owning and maintaining AI infrastructure is not a viable option for many companies due to the high costs and lack of flexibility.

Transitioning to an AI-oriented environment with OpEx-based access to infrastructure has substantial financial and strategic benefits:

Note: Cloud services and software as a service (SaaS) are prime examples of OpEx-modeled access.

phoenixNAP’s Bare Metal Cloud (BMC) is an OpEx-modeled platform that allows quick provisioning and scaling of dedicated servers via API, CLI, or Web UI.

BMC offers pre-configured server instances powered by 4th Gen Intel Xeon Scalable CPUs with built-in AMX accelerators. By combining the capabilities of Intel AMX and the BMC platform, users can:

  • Deploy enterprise-ready environments optimized for extracting value out of large datasets in minutes.
  • Leverage tools like Terraform and Ansible to automate deployments and scale AI infrastructure as needed.
  • Accelerate matrix operations to boost AI application accuracy and speed.
  • Reduce time-to-insight with one-click access to CPUs and workload acceleration engines.

Gen Intel Xeon Scalable Processors on Bare Metal Cloud deliver immediate value for the following use cases:

Application Category Use Case
Artificial Intelligence Recommendation systems
Natural language processing.
Image recognition.
Object detection.
Machine learning applications.
Video analytics.
Data Analytics Relational database management systems.
In-memory databases.
Big data analytics
Data warehousing.
Networking Hardware cryptography.
Packet processing.
Content delivery network.
Security gateway.
Storage Deployment Distributed and virtual storage.
High-Performance Computing (HPC) Computational fluid dynamics.
Molecular dynamics.
Weather simulation.
Heavy-duty AI training and inference.
Drug discovery.
Data Security Confidential computing.
Regulatory or compliance workloads.
Federated learning systems.
Ecommerce Reduce transaction time.
Manage peak demands.
UX and behavior analysis.
Automated customer support.


AI-driven solutions will become the norm for most end-users. As the cost of performing matrix computations on large datasets continues to rise, companies must explore solutions that will keep them competitive without breaking the bank.

Use the phoenixNAP BMC platform and Intel AMX to deploy and manage a flexible and scalable AI-focused infrastructure. This combination not only supports varied matrix sizes today but is also adaptable to potentially new matrix types down the line.

Đăng ký liền tay Nhận Ngay Bài Mới

Subscribe ngay

Cám ơn bạn đã đăng ký !

Lỗi đăng ký !

Add Comment

Click here to post a comment

Đăng ký liền tay
Nhận Ngay Bài Mới

Subscribe ngay

Cám ơn bạn đã đăng ký !

Lỗi đăng ký !