NVIDIA HGX and EGX Custom Training Systems

Bespoke Server Solutions for AI

 

High performance training solutions

Using a custom training system for deep learning and AI workloads gives you the ultimate control. Not only in that you can choose the ideal specification for your projects but also in that you can build in flexibility as required. A system can be configured so that no resources are under utilised, or a larger chassis can be partially populated at purchase leaving space for scaling at a later date. The choice is yours.

Every 3XS custom training system is almost infinitely configurable from accelerator cards to CPU, memory to storage, right through to connectivity, power, cooling and software - all from the market leading component brands listed below.

nvidia training server

NVIDIA GPU Accelerators

The NVIDIA datacentre family of GPU accelerator cards represents the cutting edge in performance for all AI workloads, offering unprecedented compute density, performance and flexibility. The high-end NVIDIA H100 and A100 accelerators are available in either standard PCIe or high-density SXM formats featuring HBM2 memory, with the mid-range A30 and L40 accelerators available as PCIe cards. Any of these NVIDIA datacentre GPUs offer the flexibility to be installed in a wide variety of both air and liquid cooled server chassis.

NVIDIA HGX Server NVIDIA EGX Server

8x H100 SXM5

4x H100 SXM5

8x A100 SXM4

4x A100 SXM4

H100 PCIe 5

A100 PCIe 4

L40/L40S PCIe 4

A40 PCIe 4

A30 PCIe 4

A16 PCIe 4

AI Training

Chart

AI Inference

Chart
Chart

The Transformer Engine uses a combination of software and specially designed hardware to accelerate transformer model training and inferencing, such as those commonly used in language models such as BERT and GPT-3. The Transformer Engine intelligently manages and dynamically switches between FP8 and FP16 calculations, automatically handling re-casting and scaling between the two levels of precision, speeding up large language models compared to the previous generation Ampere architecture.

The H100 Tensor Core GPUs in the DGX H100 feature fourth-generation NVLink which provides 900GB/s bidirectional bandwidth between GPUs, over 7x the bandwidth of PCIe 5.0.

Chart

Previously generation GPU-accelerators did not support confidential computing, with data only being encrypted when at rest in storage or in transit across the LAN. Hopper is the first GPU architecture to include support for confidential computing, securing data from unauthorised access as it passes through the DGX H100. NVIDIA confidential computing provides hardware-based isolation of multiple instances sharing a H100 GPU using MIG, single-user H100 GPUs and between multiple H100 GPUs.

Video

Alternative Accelerators

In addition to enterprise-class GPUs there are numerous other acceleration devices that can aid deep learning and AI training workloads. These cards may be for specific tasks, allow programmability or meet a tighter budget requirement.

 

RTX GPUs

 

Alveo PCIe Gen4

 

FPGAs

 

FPGAs

NVIDIA RTX Accelerators

NVIDIA offers a very wide range of RTX GPU accelerators capable of giving great performance for workloads of single, double or half precision. Whilst having lower costs than the Ampere enterprise-class cards, they can deliver a cost-optimised training solution when absolute top of the range performance isn’t required.

nvidia quadro server
xlinkx servers

Xilinx Accelerators

The Xilinx Alveo range of accelerator cards deliver compute, networking, and storage acceleration in an efficient small form factor, and available with 100GbE networking, PCIe 4, and HBM2 memory. Designed to deploy in any server, they offer a flexible solution designed to increase performance for a wide range of datacentre workloads.

Intel Accelerators

Intel FPGA-based accelerator cards provide hardware programmability on production qualified platforms, so data scientists can design and deploy models quickly, while allowing flexibility in a rapidly changing environment. Complete with a robust collection of software, firmware, and tools designed to make it easier to develop and deploy FPGA accelerators for workload optimisation in datacentre servers.

intel cpus with agilex
micron servers

Micron Accelerators

Micron's Deep Learning Accelerator platform is a solution comprised of a modular FPGA-based architecture, powered by Micron memory, running FWDNXT’s high performance engine tuned for a variety of neural networks. Featuring a broad deep learning framework support combined with an easy to use toolset and software programmability, these accelerators have the ability to run multiple neural networks simultaneously.

Host CPUs

You also have a wide choice of host CPUs available, AMD EPYC, Intel Xeon and Arm-based Ampere. Our system architects can recommend which of these is best for your requirements and whether one or two CPUs is required.

amd epic cpus
intel xeon cpus
intel xeon cpus

System Memory

Depending on the type of workload, a large amount of system memory may have less or more relevance than GPU memory, but with a custom training server memory capacity can be tailored to your needs. Additionally, a bespoke server allows for simple future memory expansion is required.NVIDIA recommends at least double the amount of system RAM as GPU RAM, so high-end systems may scale into the TBs. Additionally Intel Xeon based servers can make use of a combination of traditional DIMMs and Intel Persistent Optane Memory DIMMs, allowing a flexible solution addressing performance, fast caching and extra storage capacity.

samsung memory servers

Internal Storage

Storage within a training server is also a very personal choice - it may be that a few TB of SSD capacity are enough for datasets for financial organisations where a large volume of files is still relatively small. Alternatively, image-based datasets may be vast, so there is never any real option of using internal storage and a separate fast flash storage array is the way to go. If this is thecase, internal SSD cost can be minimised and this remaining budget used elsewhere. Flexibility and performance can also be gained by choosing M.2 formats, NVMe connectivity or Optane options. as required.

 
 
 

Networking

Depending on whether connectivity is needed to a wider network, or an external flash storage array, networking interfaces and speeds can be customised to suit. Ethernet or Infiniband options are available up to 400Gb/s in speed, both providing powerful CPU offloading to maximise performance, and minimise latency.

Additionally, advanced NVIDIA BlueField Data Processing Unit (DPU) NICs can be specified where the highest performance is required, as these cards not only include networking functionality but also accelerate software management, security and storage services by offloading these tasks from the CPU.

 
 
 

Chassis

From 2U compact servers up to 4U expandable systems, chassis choice is key dependant upon whether space saving is the key factor or scalability is required. As a custom server can be partially populated, a larger chassis can be chosen with a view to expandability in the future. Additionally, both air cooled and liquid cooled server systems are available.

 
 
 

GPU Virtualisation

Run:ai

gpu pool

It may be that over time, rather than a single bespoke training server, you end up with several systems as technologies advance and workloads increase. Although servers with difference CPUs, GPUs and storage will communicate when using common networking interfaces, it may be that you aren’t getting the maximum utilisation from the various GPUs you have. In this case Run:ai Atlas GPU virtualisation software may be able to help.

Run:ai Atlas combines differing GPU resources into a virtual pool and enables workloads to be scheduled by user or project across the available resource. By pooling resources and applying an advanced scheduling mechanism to data science workflows, Run:ai greatly increases the ability to fully utilise all available resources, essentially creating unlimited compute. Data scientists can increase the number of experiments they run, speed time to results and ultimately meet the business goals of their AI initiatives.

Our intuitive online configurators provide complete peace of mind when building your training server
alternatively speak directly to one of our friendly system architects.