POD Reference Architectures for AI & HPC at scale

Scan AI, as a leading NVIDIA Elite Solution Provider, can deliver a variety of enterprise infrastructure architectures with the either NVIDIA EGX, HGX or DGX servers at their centre, known as PODs. These reference architectures combine industry-leading NVIDIA GPU compute with AI-optimised flash storage from a variety of leading manufacturers and low latency NVIDIA Networking solutions, in order to provide a unified underlying infrastructure on which to accelerate AI training whilst eliminating the design challenges, lengthy deployment cycle and management complexity traditionally associated with scaling AI infrastructure.

Although there is a vast variety of ways in which each POD infrastructure solution can be configured, there are four main architecture families - a Scan POD, based on NVIDIA EGX and HGX server configurations; an NVIDIA BasePOD including made up of 2-40 DGX H100 appliances; an NVIDIA SuperPOD consisting of up to 140 DGX H100 appliances centrally controlled with NVIDIA Unified Fabric Manager; and an NVIDIA DGX GH200 designed exclusively for LLMs and generative AI. All these infrastructures then connect to a choice of enterprise storage options linked together by NVIDIA Networking switches. Click on the tabs below to explore each solution further.

The Scan POD range of reference architectures are based around a flexible infrastructure kit list in order to deliver cost effective yet cutting-edge AI training for any organisation. A Scan POD infrastructure consists of NVIDIA EGX or HGX servers - starting at just two nodes - connected via NVIDIA Networking switches to a choice of NVMe storage solutions. This can then complemented by Run:ai Atlas software and supported by the NVIDIA GPU-optimised software stack available from the NVIDIA GPU Cloud (NGC).

Scan POD Servers

At the heart of a Scan POD architecture is either an NVIDIA-certified EGX or HGX GPU-accelerated server built by our in-house experts at 3XS Systems.

3XS EGX Servers

Up to 8x NVIDIA professional Ampere or Ada Lovelace PCIe GPUs
2x Intel 4th gen Xeon or AMD 4th gen EPYC CPUs with PCIe 5.0 support
Up to 2TB of DDR5 system memory
NVIDIA ConnectX Ethernet NICs / Infiniband HCAs
Up to 6x NVMe drives

3XS HGX Servers

Up to 8x NVIDIA A100 SXM4 GPUs
2x Intel 3rd gen Xeon or AMD 3rd gen EPYC CPUs with PCIe 4.0 support
Up to 2TB of DDR4 system memory
NVIDIA ConnectX Ethernet NICs / Infiniband HCAs
Up to 6x NVMe drives

Scan POD Management

The EGX and HGX systems are managed using Run:ai Atlas software to enable not only scheduling and orchestration of workloads, but also virtualisation of the PODs GPU resource. Run:ai Atlas automates resource management and consumption so that users can easily access GPU fractions, multiple GPUs or clusters of GPUs for workloads of every size and stage of the AI lifecycle. This ensures that all available compute can be utilised and GPUs never have to sit idle. Whenever extra compute resources are available, data scientists can exceed their assigned quota, speeding time to results and ultimately meeting the business goals.

Centralise AI

Pool GPU compute resources so IT gains visibility and control over resource prioritisation and allocation

Maximise Utilisation

Automatic and dynamic provisioning of GPUs breaks the limits of static allocation to get the most out of existing resources

Deploy to Production

An end-to-end solution for the entire AI lifecycle, from developing to training and inferencing, all delivered in a single platform

Scan POD Networking

Scan POD architectures can be configured with a choice of network switches, each relating to a specific function within the design and whether InfiniBand, Ethernet or both are being utilised.

NVIDIA QM9700 Switch

NVIDIA QM9700 switches with NDR InfiniBand connectivity link to ConnectX-7 adapters. Each server system has dual connections to each QM9700 switch, providing multiple high-bandwidth, low-latency paths between the systems.

NVIDIA QM8700 Switch

NVIDIA QM8700 switches with HDR InfiniBand connectivity link to ConnectX-6 adapters. Each server system has dual connections to each QM8700 switch providing multiple high-bandwidth, low-latency paths between the systems.

NVIDIA SN4600 Switch

NVIDIA SN4600 switches offer 64 connections per switch to provide redundant connectivity for in-band management. The switch can provide speeds up to 200GbE. For storage appliances connected over Ethernet, these switches are also used.

NVIDIA SN2201 Switch

NVIDIA SN2201 switches offer 48 ports to provide connectivity for out-of-band management. Out-of-band management provides consolidated management connectivity for all components in the Scan POD.

The Scan POD topology is flexible when it comes to configuration and scalability. Server nodes and storage appliances can be simply added to scale the POD architecture as demand requires.

Scan POD Storage

For the storage element of the Scan POD architecture, we have teamed up with PEAK:AIO to provide AI data servers that deliver the fastest AI-optimised data management around. PEAK:AIO’s success stems from understanding the real-life values of AI projects - making ambitious AI goals significantly more achievable within constrained budgets while delivering the perfect mix of the performance of a parallel filesystem with the simplicity of a simple NAS, all within a single 2U server. Furthermore, PEAK:AIO starts as small as your project needs, and scales as you need, removing the traditional requirement to over invest in storage at the onset. Additionally, a longstanding complication within high performance storage has been the need for proprietary drivers which can cause significant disruption and worse within typical AI projects when OS or GPU tools are updated. PEAK:AIO is fully compatible with modern Linux kernels, requiring no proprietary drivers.

LEARN MORE

Secure Hosting

Accommodating a Scan POD architecture may not be possible on every organisations premises, so Scan AI has teamed up with a number of secure hosting partners with UK and European based datacentres. This means you can be safe in the knowledge that the location that houses your infrastructure is perfect to manage a Scan POD infrastructure and accelerate your AI projects.

The DGX BasePOD is an NVIDIA reference architecture based around a specific infrastructure kit list in order to deliver cutting edge AI training for the enterprise. A BasePOD infrastructure consists of NVIDIA either DGX H100 appliances - ranging from two to 40 nodes - connected via NVIDIA Networking switches to a choice of enterprise storage solutions. This is then complemented by Base Command management software and the NVIDIA AI Enterprise software stack to form a complete solution.

BasePOD Servers

At the heart of a BasePOD architecture is either the DGX the DGX H100 GPU-accelerated server appliance. Both models of DGX provide exceptional compute performance powered by eight H100 GPU accelerators respectively.

NVIDIA DGX H100

8x NVIDIA H100 GPUs
80GB memory per GPU
4x NVIDIA NVSwitch chips
2x Intel 4th gen Xeon 56-core CPUs with PCIe 5.0 support
2TB of DDR5 system memory
4x OSFP ports serving 8x single-port NVIDIA ConnectX-7 NDR
and 3x dual-port
NVIDIA ConnectX-7 NDR InfiniBand HCAs
2x 1.92TB M.2 NVMe drives for DGX OS
8x 3.84TB U.2 NVMe drives for storage /cache
11.3kW max power

BasePOD Management

The DGX systems are managed and controlled by NVIDIA Base Command software - using it, every organisation can tap the full potential of its DGX BasePOD investment with a platform that includes enterprise-grade orchestration and cluster management, libraries that accelerate compute, storage and network infrastructure and system software optimised for running AI workloads.

Trusted by NVIDIA

The same software that supports NVIDIA’s thousands of in-house developers, researchers, and AI practitioners underpins every BasePOD.

Scheduling and Orchestration

Base Command provides Kubernetes, Slurm, and Jupyter Notebook environments for DGX systems, delivering an easy-to-use scheduling and orchestration solution.

Comprehensive Cluster Management

Full-featured cluster management automates the end-to-end administration of DGX systems - whether it’s initial provisioning, operating system and firmware updates, or real-time monitoring.

Optimised for DGX

The Base Command software stack is optimised for DGX BasePOD environments that scale from two nodes to 40 node clusters, ensuring maximum performance and productivity.

Enhanced Management with Run:Ai

Run:ai Atlas integrates with NVIDIA Base Command, combining GPU resources into a virtual pool and enables workloads to be scheduled by user or project across the available resource. By pooling resources and applying an advanced scheduling mechanism to data science workflows, Run:ai greatly increases the ability to fully utilise all available resources, essentially creating unlimited compute. Data scientists can increase the number of experiments they run, speed time to results and ultimately meet the business goals of their AI initiatives.

BasePOD Networking

DGX BasePODs can be configured with four types of network switches, each having a specific function within the design - there are a choice of InfiniBand switches depending on the DGX appliance used, supported by Ethernet switches for management and storage connectivity.