Run:ai Atlas Platform

Access 100% of your compute power, no matter when you need it, to accelerate AI development

Getting the most out of your AI infrastructure

It is estimated that more than 80% of all AI models don't make it to production. One significant reason for this is that AI requires an entirely new infrastructure stack - including frameworks, software and hardware accelerators. These valuable resources are complex to manage and are often left sitting idle due to static allocation. When resources are allocated to teams or researchers who aren’t actively using them, it wastes compute that could otherwise be used for other tasks. The Run:ai Atlas platform breaks this paradigm by creating virtual pools of GPUs and automatically allocating the right amount of compute for every task, from huge distributed computing workloads to small inference jobs.

Run:ai Atlas automates resource management and consumption so that users can easily access GPU fractions, multiple GPUs or clusters of GPUs for workloads of every size and stage of the AI lifecycle. This ensures that all available compute can be utilised and GPUs never have to sit idle. Whenever extra compute resources are available, data scientists can exceed their assigned quota, speeding time to results and ultimately meeting the business goals of their AI initiatives.

Centralise AI

Pool GPU compute resources so IT gains visibility and control over resource prioritisation and allocation

Maximise Utilisation

Automatic and dynamic provisioning of GPUs breaks the limits of static allocation to get the most out of existing resources

Deploy to Production

An end-to-end solution for the entire AI lifecycle, from developing to training and inferencing, all delivered in a single platform

How Run:ai Atlas Works

Run:ai’s Atlas scheduler is a simple plug-in for Kubernetes clusters and adds high-performance orchestration to your containerised AI workloads. Using multi- and hybrid-cloud platforms, powered by a cloud-native operating system, Atlas supports running AI initiatives anywhere - on-premises, on the edge or in the cloud. The Run:ai dynamic workload-aware scheduler requires no advanced setup, and can work with any number of Kubernetes orchestration flavours such as vanilla, RedHat OpenShift and HPE Container Platform. The GPU abstraction layer also offers deep integration into AI accelerators and enables efficient sharing and automated configuration of these resources across multiple workloads.

Self-serve

For data scientists who prefer not to interact directly with code, Run:ai offers a native ResearcherUI and support for a wide variety of popular ML tools as well as command line. Researchers can launch jobs with their choice of tools, and Run:ai’s scheduler automatically and fairly allocates resources. There’s no need for IT to manually provision GPUs. IT simply sets custom rules and prioritisation based on their organisation’s goals

Simple Workload Scheduling

The Run:ai Scheduler is a Kubernetes-based software solution for high-performance orchestration of containerised AI workloads. Bridging the efficiency of High-Performance Computing and the simplicity of Kubernetes – the scheduler allows users to easily make use of fractional GPUs, integer GPUs, and multiple-nodes of GPUs, for distributed training. In this way, AI workloads run based on needs, not capacity. Run:ai requires no advanced setup, and can work with any number of Kubernetes orchestration versions including Vanilla, RedHat OpenShift and HPE Container Platform.

Batch Scheduling

This refers to the grouping or batching together of many processing jobs that can run to completion in parallel without user intervention. This way programs run to completion and then free up resources upon completion, making the system much more efficient. Training models can be queued and then launched when resources become available. Workloads can also be stopped and restarted later if resources need to be reclaimed and allocated to more urgent jobs or to under-served users.

Gang Scheduling

Often when using distributed training to run compute intensive jobs on multiple GPU machines, all of the GPUs need to be synchronised to communicate and share information. Gang scheduling is used when containers need to be launched together, start together, recover from failures together, and end together. Networking and communication can be automated between machines by the cluster orchestrator.

Topology Awareness

This concept describes how a researcher can run a container once and get excellent performance and then the next time get poor performance on the same server. The problem comes from the topology of GPU, CPU, and the links between them. The same problem can occur for distributed workloads due to the topology of NICs and the links between GPU servers. The Run:ai scheduler ensures that the physical properties of AI infrastructure are taken into account when running AI workloads, for ideal and consistent performance.

How-to Videos

Optimizing GPU Utilization with Run:ai

Learn about the difference between GPU utilization and GPU allocation and how Run:ai can be used to increase utilization.

Open Demo: Autoscaling Inference on AWS

Join Guy Salton, Director of Solution Engineering from Run:ai, as he discusses Auto-Scaling Inference on AWS.

Training (Batch) Jobs

Get a glimpse of Run:ai Batch (Training) capability as presented by our Solution Engineering Team.

King’s College London

The London Medical Imaging & Artificial Intelligence Centre for Value Based Healthcare based at King’s College London (KCL), is using the latest virtualisation software from Run:ai to speed research projects. The software optimises and enhances existing computing resources - NVIDIA DGX-1 and DGX-2 supercomputers plus their associated infrastructure installed and configured by the Scan AI team.

These NVIDIA DGX platforms are used to train algorithms to create AI powered tools for faster diagnosis, personalised therapies and effective screening, using an enormous trove of de-identified public health data from the NHS. The training requires vast amounts of GPU-accelerated compute power, which the DGX appliances provide, but to improve resource allocation and scheduling Run:ai software was added to the KCL platform - this has since doubled its GPU utilisation to support more than 300 experiments within a 40 day period.

Read case study

Guided Proof of Concepts with Scan AI and Run:ai

The Scan AI team is unique in its ability to offer a Proof of Concept (PoC) trial of the Run:ai software platform running on multiple NVIDIA GPUs. This allows you to understand how the scheduling and pooling software will improve your GPU utilisation and there are two options as to how the POC can be conducted:

POC in a customer's on premises environment

In this scenario, a prospect's data scientists can run an evaluation of Run:ai in their own environment, using their own workflows. They can choose one, multiple, or all of their servers to ruin their POC. A trial done this way allows researchers to continue to run experiments with Run:ai in their production environment without having to transfer experiments to a test environment. The on-premises POC allows them to compare Run:ai to their existing tools, easily enabling users to see benchmarks and measure efficiency of the Run:ai system. Currently, prospective customers do not need to purchase a POC license, but this may change in 2021.

POC in a Scan Lab Environment

In the second scenario, a prospective customer can avoid setting up a dedicated production cluster for a POC and instead use a pre-prepared Scan environment to evaluate Run:ai. In this scenario, prospects can try out all of the features of Run:ai, even though they would not be able to see their own data center - for example, pooling disparate resources and scaling distributed training across many nodes, which they can run in the Scan environment even if their own usage of DL / ML is not yet at a large scale. The Scan environment is ready-to-use and already has Kubernetes and Run:ai installed, so customers can avoid the potential inconvenience of installation.

Ways to Purchase

Applying Run:ai software to an existing cluster of GPUs, no matter how disparate, you will see an immediate improvement in how your newly virtualised pool of GPU resource can be scheduled and shared out.

Run:ai software is licensed per GPU you want to virtualise - regardless of the age or specification of any GPU, making for a very easy way to improve productivity and to keep increasing your virtual GPU pool as you add GPUs to your infrastructure.

The 3XS Systems team and Run:ai has developed a range of certified appliances - designed, tested and configured to get the most out of GPU virtualisation whilst remaining cost effective. They each include a 1-year licence for Run:ai software and cover a range of specifications - from development workstations to server platforms.

Model	Development Workstation	Training Server	Training Server

TF32 Performance	116TF	450TF	656TF
FP64 Performance	1.9TF	7.2TF	41.6TF
Cost	£18,999 ex VAT	£50,499 ex VAT	£62,499 ex VAT
Where to buy	View model	View model	View model
GPUs	4x watercooled NVIDIA GeForce RTX 3090	6x NVIDIA A40	8x NVIDIA A30
GPU Specifications	10,496 CUDA cores per GPU	10,752 CUDA cores per GPU	3,804 CUDA cores per GPU
GPU Memory	24GB GDDR6X per GPU, 96GB total	48GB GDDR6 per GPU, 288GB total	24GB HBM per GPU, 192GB total
GPU Interconnects	GPUs paired with NVLink	GPUs paired with NVLink	GPUs paired with NVLink
CPU	AMD Threadripper PRO 3975WX, 32C/64T	2x AMD EPYC 7513, combined 64C/128T	2x AMD EPYC 7702, combined 128C/256T
System Memory	256GB ECC Reg DDR4	512GB ECC Reg DDR4	1,024GB ECC Reg DDR4
System Drives	1TB SSD SSD	1TB SSD	2x 1TB SSD
Storage Drives	4TB HDD	4x 3.84TB SSD	4x 3.84TB SSD
Networking	2x 10GbE	2x 200GbE/IB	2x 200GbE/IB
Operating System	Ubuntu Linux	Ubuntu Linux	Ubuntu Linux
Run:ai License	1 Year Subscription	1 Year Subscription	1 Year Subscription
Power Requirement	2,750W	6,600W	6,600W
Dimensions	307 x 697 x 693mm Tower	4U 19in Rackmount	4U 19in Rackmount

Call us on 01204 474747 or Email us at [email protected]

Access 100% of your compute power, no matter when you need it, to accelerate AI development

Getting the most out of your AI infrastructure

Centralise AI

Maximise Utilisation

Deploy to Production

How Run:ai Atlas Works

Self-serve

Single Pane of Glass

Simple Workload Scheduling

Simple Workload Scheduling

Batch Scheduling

Gang Scheduling

Topology Awareness

How-to Videos

Optimizing GPU Utilization with Run:ai

Open Demo: Autoscaling Inference on AWS

Training (Batch) Jobs

King’s College London

Guided Proof of Concepts with Scan AI and Run:ai

POC in a customer's on premises environment

POC in a Scan Lab Environment

Ways to Purchase