Cloud Buyers Guide

Cloud computing offers an alternative to having expensive computer systems at every desk in the work environment. Although at first glance it appears simply as a different way to deliver applications to your users, there are many perhaps unappreciated advantages (and some disadvantages) of using cloud technology. Additionally, cloud services usually have a large range of user profiles available and so choosing the right profile can be just as confusing as configuring a physical PC, workstation or laptop with a huge range and choice of components. Understanding exactly which type of cloud service and combination of profile specifications are the best fit for any particular workload may help you select the right virtual instance for your needs right now or over a longer period.

This guide will take you through all the major considerations of choosing a cloud service and highlight potential areas to consider while you choose each of them from initial choice of platform down to the individual elements of a cloud profile and how to get the most from it. This guide also covers other factors to consider and some wider related topics points too, as these may all place a part in the choices you’ll make. Let’s begin.

What is Cloud Computing?

Before we get into the details of what to look out for in a cloud service, we’ll start by covering what cloud computing is, and how it differs to using physical workstations or servers. Of course whether, you’re ‘in the cloud’ or not, you’re still ultimately accessing physical devices - the main difference is what is actually located on or under your desk space. With physical infrastructure you’ll have a desktop PC or workstation that contains all the components and power you need to perform your work, whereas with a cloud or virtual infrastructure, this will all be centralised in a server room or datacentre. If you imagine a workstation at your desk, it will contain a CPU and GPU that have a finite capability regardless whether you are utilising 10% or 100% of their power. For a cloud workstation the CPU and GPU resource will be allocated to you at just the right level for you to carry out your work. If your work gets more demanding you’ll be allocated more resource - if it gets less demeaning you’ll get less. Nothing is wasted and no resource sits idle. The below video explains a bit more about this.

Video Click

Although this is the fundamental difference between using physical or virtual machines, there are many other factors to consider and advantages and disadvantages of both scenarios - we’ll cover these in more detail shortly and then afterwards we’ll look at the parameters that make up a cloud profile and how to choose the best option to suit you.

Cloud TCO

Now we’re familiar with what cloud computing is, it’s probably worth looking at the wider picture as it offers some perhaps unexpected advantages, alongside some disadvantages that should also be understood. The below table compares the scenarios of a cloud infrastructure versus a traditional physical one, as many of these factors have a direct impact on the total cost of ownership (TCO) of the solution.

Feature Cloud Infrastructure Physical Infrastructure
GPU Hardware Location Centralised - in the datacentre Dispersed - at the desktop
Number of users per GPU device Multiple One
Hardware Flexibility Dynamically assign multiple GPUs to one person or multiple people to a single GPU as workflows require Single user using the GPU resource of a single machine
Scalability Add extras users as required Dedicated hardware required for each user
User Flexibility High - users can work on almost any device from anywhere and collaborate more effectively with some solutions Low - users must be at the desk
Management Admin required to create new users and assign GPU resource PC set-up only required
Security High levels required as data is being transferred consistently from servers to devices Normal levels required as data is local
Work Offline No - as connection to GPU servers required at all times Yes - GPU hardware is local so no network connection is needed
Infrastructure Audit Recommended - as there is a need to ensure the network is fast enough and secure enough to support a virtual deployment Not recommend as only adding PCs or workstations to the network
Cost Potentially high outlay in initial GPU server hardware. Low TCO per user Low outlay as no server hardware needed. Potentially high TCO per user (depending on PC or workstation spec)
GPU Utilisation Potentially 100% as resource can be sliced and segregated as needed Rarely 100% as the hardware has to be provisioned for the most demanding task required even if for limited times
Productivity Very high - many users can work during the day and GPU resource can be pooled overnight for large tasks such as rendering Low - users can only work one per physical workstation
Need for Upgrade Low and not often - new more demanding applications and datasets can easily be accommodating by simply assigning more resource to that user from the GPU pool High and often - a workstation may need to be upgraded regularly as applications may outperform the finite GPU hardware

The conclusions that can be drawn from this comparison is that medium to large size organisations may benefit from the flexibility and efficient scalability of a virtualised GPU solution. As user numbers vary, individual workloads increase and decrease and potential working locations may be many and varied, a centralised infrastructure will allow ultimate control of how GPU resource is allocated for maximum productivity and flexibility amongst the workforce. There will be more initial cost, but productivity gains would be expected to easily offset this. It is also worth mentioning that in these uncertain times, a cloud infrastructure adds more resilience to any organisation as the inherent flexibility and adaptability allows for a much more rapid response in the face of a sudden requirement to work from home, or from varied locations in the case of a disaster recovery scenario.

Although we are comparing the cloud versus physical infrastracture in the above table, it doesn’t necessarily have to be that clear a decision - you may choose to deliver a virtual desktop solution to all your employees, making onboarding and delivery of internal apps simple and centrally controlled, but keep your content creators or data analysts with GPU-accelerated workstations at their desks, as they may be few in number and have very specific and predictable GPU requirements. This kind of hybrid scenario is very common in business.

Public or Private Cloud

So, for the purposes of this guide lets now assume you are definitely looking at some form of cloud infrastructure for your organisation. As shown in the video above, cloud computing is essentially a different way to deliver the hardware and software resource required to complete a task. However, although we now understand that the resource is centralised instead of at each desk, the first question to address is where that centralised hardware should be located?

The first option is to locate the centralised server hardware on your own premises - this is termed a ‘private cloud’, as your organisation owns, maintains and controls all the infrastructure and connectivity. The second option is to simply rent hardware resource from a third party and let them take control of all the admin and maintenance. You simply pay for what you use and this is termed ‘public cloud’, as other organisations will also be sharing that resource - albeit in a secure segregated fashion. There is a third option where your hardware hosted by a third party, secure datacentre and connects to your own infrastructure - the hardware is yours but you pay another organisation to perform the day-to-day maintenance, admin, connectivity and access rights - this is called ‘hybrid cloud’.





If you opt to build your own private cloud, then you’ll have to invest in all the datacentre hardware required to deliver the solution - GPU servers, storage and a wider networking infrastructure capable of delivering the applications to your teams whether they are office-based or remotely located. Once you’ve ensured your hardware is up to scratch you’ll then need the appropriate software platform to deliver services you require such as NVIDIA RTX Virtual Workstation (vWS), NVIDIA Virtual Compute Server (vCS) or NVIDIA Virtual PC (vPC).

Alternatively, a public cloud service such as Scan Cloud, provides the infrastructure for your applications to run on, without the need for upfront hardware investment. Flexible subscriptions take care of the GPU systems, storage and networking assets that enable your workers to connect wherever they need to be and there’s no ongoing staff overheads to manage or maintain that infrastructure. The below table compares the two scenarios.

Feature Public Cloud Private Cloud
GPU Hardware Location Multiple organisations but securely hosted and segregated Only your organisation
Number of users per GPU device No Yes
Hardware Flexibility No Yes
Scalability No Yes
User Flexibility Licenses only Hardware & Licenses
Management Smaller organisations Larger organisations

Whether you choose public or private cloud, or a hybrid option, you’ll still need to understand what resource individual users or tasks will require. Any given user on any cloud platform will need a ‘profile’ that defines what resource they have access to - this can be static for a given user if they always undertake the same work or may change task by task if some projects are more demanding than others. We’ll next look at cloud profiles and then in more detail at each resource aspect.

Cloud Profiles

Profile is the term used to define what specification the virtual service you are using has - it can also be referred to as a cloud instance. It is made up of CPU cores, GPU memory, system memory and storage capacity. How these are provisioned and managed differs depending whether you are dealing with a public or private cloud service.

Public cloud services such as Scan Cloud providing a best-fit service for a given range of applications. Certain profiles will be best suited to graphical applications, other to video or rendering and high-end ones will be designed around scientific compute workloads such as HPC, deep learning and AI - although depending on the software package(s) you are using there will always be overlap in suitability. Likewise, your project specifics and complexities will also impact your cloud profile choices. Some examples of the Scan Cloud profile range can be viewed below or you can learn more by selecting a workload type.

Graphics Workloads

deep learning computer system

Video Workloads

3xs data science workstation

Rendering Workloads

3xs data science workstation

HPC & AI Workloads

3xs data science workstation

For a private cloud platform there are also tailored services aimed at certain use cases, but as you simply buy a licence for any given service per user, you will need to purchase servers powerful enough to deliver the level and quality of service you wish to have. For example, a server with two processors (CPUs) and eight graphics cards (GPUs) will give a very different experience if shared by two users or ten users, as the resource is finite. Typical examples of servers and what they may deliver are shown below or you can learn more by selecting a service type.

Component Light Use (16-24 users per server) Medium Use (9-18 users per server) Heavy Use (6-12 users per server)
GPU 4-8x NVIDIA T1000 4-8x NVIDIA RTX 5000 / A40 8x NVIDIA RTX 6000 Ada / L40 / H100
CPU 1x AMD EPYC 2x Intel Xeon / AMD EPYC 2x Intel Xeon / AMD EPYC
RAM 128 - 512GB RAM 512 - 768GB RAM 1TB+ RAM
Networking 10GbE Network 50GbE Network 100GbE / 100Gb InfiniBand Network
Storage Flash Based Storage Flash Based Storage Flash Based Storage
Typical Usage VDI / CAD CAE / Video / Rendering HPC / AI

NVIDIA vPC

deep learning computer system

NVIDIA vWS

3xs data science workstation

Cloud Profile Components

For the next few sections we will look at the individual components of a cloud profile - GPU, CPU, RAM, storage capacity and operating system. We have assumed a public cloud scenario to understand how you should choose between profiles, however if you are considering private cloud these sections still act as a guide to what you would need per user - the overall combined resource would then be consolidated into a server spec that could deliver your chosen private cloud services. Let’s start with GPU.

Cloud GPU

Most cloud platforms has been designed as an alternative to GPU-accelerated workstations and servers, so perhaps unsurprisingly we consider the GPU specification the most important element of any cloud profile. It is the degree of graphical horsepower that will dictate whether you are limited to basic 2D modelling or whether you can render large complex 3D animations with ease. The GPU accelerators used to deliver cloud services will always be professional grade models rather than consumer versions, due to their higher-grade components designed to run 24/7, certified driver support, enhanced security and ECC memory technology.

The GPU power in a virtual profile is broken down into GPU memory, CUDA cores, Tensor cores and RT (ray-tracing) cores - very much the same parameters you would consider if buying a physical GPU. Essentially GPU memory, measured in GB, refers to the physical amount of memory you have to work with - if you have very large animations or complex renders then a large amount of GPU memory will be required. When it comes to cores, they are there to perform parallel compute functions - whether involved with content creation and how a 3D image appears, performing ray tracing tasks within CGI filmmaking or doing matrix multiplication calculations on a large dataset for machine learning.

CUDA cores perform the bulk of this work, whereas Tensor cores are employed to speed up complex calculations as seen in deep learning and AI applications. As the name suggest RT cores are used to improve ray tracing performance. Generally speaking, all three of these core types increase in number, along with GPU memory capacity as you move up through the cloud profiles, as the tasks and applications these higher-end instances are aimed at increases in complexity and demand. They also reflect a ‘time to results’ aspect too, in that more mission-critical workloads can be completed in less time.

Typical GPU Profiles Entry Level Mid Range High End
GPU Memory 1-4GB 24-48GB 4-8x 48GB
CUDA Cores 896 10,752 4-8x 10,752
Tensor Cores 28 336 4-8x 366
RT Cores 7 84 4-8x 84
Recommended Usage Office apps / CAD 3D Modelling / Editing / Animation / VFX / AI Development Rendering / AI Training / HPC

It is worth mentioning that up to 48GB of GPU memory in the mid range profiles, the centralised server resource is using a single or fractions of a single NVIDIA RTX GPU card. Above this multiple GPUs are combined, so our profiles list the individual GPU specs multiplied by the number of GPUs allocated to the profile. Although cloud services often have a wide variety of profiles available - tailored to specific workloads in most cases, there is the flexibility to request a bespoke profile, as a cloud service by definition is about allocating compute resource that is just right for the requirement.

Cloud CPU

Cloud services are run from server hardware so the CPUs are always going to be either AMD EPYC or Intel Xeon Scalable processors, as these are designed to perform with enterprise-grade workloads, deliver enhanced security and are designed for running many virtual machines - the nature of a cloud service.

Typical GPU Profiles Entry Level Mid Range High End
CPU Cores 6 16-24 32-128
Recommended Usage Office apps / CAD 3D Modelling / Editing / Animation / VFX / AI Development Rendering / AI Training / HPC

A greater number of cores will be required when multiple tasks are being carried out and the general principle is for CPU cores to increase in line with system memory and GPU capability, as the more demanding applications will usually require increasing resource in all areas. There are a few exceptions to this and if CPU cores or CPU frequency is the key to any project, bespoke profiles can be assigned to address this need and optimise performance.

Cloud Memory

As previously mentioned, cloud profile memory or RAM allocation is very much tied to CPU cores, as much like in a physical system the two are interlinked - too little memory is a limiting factor for a powerful processor and extra memory is wasted if the CPU is not powerful enough.

Typical RAM Profiles Entry Level Mid Range High End
Memory 16GB 64-128GB 256-1024GB
Recommended Usage Office apps / CAD 3D Modelling / Editing / Animation / VFX / AI Development Rendering / AI Training / HPC

As the backbone for a cloud service is based on server technology, any associated memory will usually be of the Registered variety (RDIMM) and feature ECC (Error Correction Code) capability that detects and corrects the most common kinds of memory data corruption before it can cause data errors, or worse still, system crashes.

Cloud Storage

Each cloud instance will be provided with high performance SSD storage for the operating system and applications. This capacity usually increases as the other aspects of the profiles - GPU, CPU and memory - increase as they anticipate using a greater number of, and more complex, applications. Storage of project data, such as models, textures and video is provided via dedicated cloud storage nodes that can be accessed by the PC or workstation profiles in your account - this capacity is usually tailored to your use, as every project size and demand will be different.

Typical Storage Profiles Entry Level Mid Range High End
OS / Application Capacity 500GB 1-2TB 7-15TB
Project Capacity Bespoke Bespoke Bespoke
Recommended Usage Office apps / CAD 3D Modelling / Editing / Animation / VFX / AI Development Rendering / AI Training / HPC

As you would expect the SSD drives and arrays providing both types of data capacity are enterprise grade and fully RAID protected to ensure your data integrity is maintained.

Operating Systems & Software

Each cloud instance will need an operating system on which to install the applications, just as you would need on any physical machine. For PC and workstation profiles this will be a fresh install of Windows 10 Pro, in line with what you would expect in any professional office environment. For AI and HPC instances a Linux-based operating system is the most common with specific distributions being available if your project demands.

All you need to do then is install and license your applications - the majority of content creation, scientific and high performance computing applications are certified to work on virtualised environments, so using them on a cloud platform should be indistinguishable from using them on a physical machine. For any frameworks and libraries required, then the usual method of downloading them from a service such as NVIDIA GPU Cloud (NGC) still applies. Popular supported applications are listed in the tabs below.

Access, Security & Support

The nature of a cloud service is that to deliver its flexibility and offer the ability to personalise profiles, is that it must run on a highly available infrastructure. To this end, all cloud services will be hosted within a datacentre environment, as only such facilities can provide the required admin, support and technical personnel that ensure 24/7 availability with maximum uptime - typically 99.999% or greater. They are also designed around scalability to cope with sudden or continuous increases in demand without performance suffering to any users. Additionally, these facilities also allow for the best in physical security in the form of restricted access, but also comprehensive degrees of redundancy when it comes to high quality power supply, effective cooling, zero single-point-of-failure hardware configurations, robust data security and encrypted remote access for the users.

When it comes to support for your cloud infrastructure this really is a tale of two halves - for a private cloud solution the ownership of maintenance, admin and amazement falls on the organisation as it will often sit on their premises. It may be hosted within a datacentre where a third-party company will carry out these tasks for a fee. For a public cloud, there is little or no support required by the organisation as the cloud company takes care of everything, however it must also be stated that hyperscale providers such as Amazon, Google et. will not offer any technical support either so you are often on your own expertise-wise. It is therefore worth mentioning at this point, that that unlike many hyper scale services the Scan Cloud service is supported by a full team of expert consultants including application specialists, software engineers and hardware architects to support and enhance your cloud experience.

Collaboration

As we’ve seen throughout this guide, flexibility is one of the key features of any cloud platform - whether private or public. The inherent nature of cloud is that multiple users can work on the same project without needing to be in the same location, and additionally they can continue to work if they change location as a laptop can be used at home, in the office or on any work site. Because the GPU hardware isn’t physically with them they can effectively access it anywhere, at any time.

The development of a platform called NVIDIA Omniverse Enterprise has taken this concept one step further. Even with flexible location and adaptability, most projects are still bound by sharing a single version and having to transfer the latest one between active users. They may also be limited by different users needing to use different software applications to finish a project causing compatibility problems. NVIDIA Omniverse Enterprise changes this - not only does it connect previously incompatible software applications, it lets users collaborate on the same project at the same time - in real time.

Video Click

Through a combination of apps and connectors all users have the ability to share, collaborate and increase their productivity regardless of their location, applications and hardware. The only requirement is an NVIDIA RTX GPU to provide the necessary power to access the platform - either deployed as a service on private cloud or an additional extra on a public cloud profile.

LEARN MORE ABOUT OMNIVERSE ENTERPRISE

Virtualisation

As previously mentioned, our AI and HPC public cloud profiles combine multiple GPUs and deliver enormous performance and workload throughput capabilities, but usually only for a single user per profile. GPU pooling consolidates these resources in order to gain greater control and visibility, by deploying additional Run:ai Atlas software on top of your virtual instance. This platform acts to decouple data science workloads from the GPU resources powering them, by pooling the physical GPUs and applying an advanced scheduling mechanism get the most out of often varied yet demanding workflows.

Video Click

Run:ai Atlas greatly increases the ability to fully utilise all available resources, from pooled GPUs to fractional GPUs, to ensure that all your users can adequately receive the time and power they need to get their work completed whilst allowing for fast iteration across an entire project team.

LEARN MORE ABOUT RUN:AI ATLAS SOFTWARE

Start your Cloud Journey

We hope you’ve found our guide to cloud platforms and services useful, giving you the additional knowledge to make the right choices for your organisation and individual set of circumstances. You can start you cloud journey be submitting the below form to arrange a consultation with one of our experts including a free proof-of-concept trial.

We look forward to hearing from you.

LEARN MORE