Choosing between off-premise Public Clouds and on-premise Private Clouds
Given the wide range of infrastructure cloud computing (IaaS) options soon to be available to staff, faculty and researchers at CU Boulder, we’d like to provide some initial guidance as to how to choose between these options. Many services are well suited for off-premise public cloud environments, but there are also some that would be much better served in an on-premise private cloud, whether the OIT private cloud or the Research Computing offerings.
As we consider this decision there are a few major factors to consider:
- Business Drivers: characteristics of the service that drive value back to the business
- Data Gravity: maintaining data “close” to the application
- Service characteristics: aspects of the service itself that “fit” a particular cloud
- Natural Affinity: characteristics of the service may naturally fit with a specific provider (e.g. Microsoft centric services may be a better fit on a Microsoft cloud)
- Security/Compliance: regulatory requirements may be a factor for a specific cloud
- Agility: Off-premise public cloud can provide agility in the form of improving speed of delivery, adjusting to changing requirements and providing a platform for quick innovation. It is important to consider the user base of the application and their willingness to adapt to change quickly since that may limit the practical agility of the overall solution
- Cost: monetary cost savings should not be the primary driver for public cloud adoption as it is not always less expensive from a financial perspective. Implicit or opportunity costs in terms of time saved by leveraging one option or another may be a contributing factor.
- Expected Growth Pattern: applications which have the potential for fast growth in usage may be a good fit for public cloud since new resources can be added much more quickly
- Elasticity of load: if an application is highly used at certain times of the day or year relative to low baseline usage it may benefit from giving back unneeded resources during slow times (cost savings)
- Short time to live: if an application is only needed for a short period of time, public cloud may provide a platform to improve speed of deployment and teardown (agility).
- Identifying a clear service owner for the application/service is imperative to ensure someone is empowered to makde tradeoff decisions (e.g. cost/performance) so that the flexibility of public cloud will be fully realized.
Generally, computing resources should be placed as closely as possible to the bulk of data that they will operate on. Often times this creates a natural grouping of applications or services which should choose to be deployed and run together either in off-premise public cloud or on-premise private cloud.
- Applications which can scale horizontally (adding more servers) rather than vertically (larger servers) tend to benefit more from off-premise public cloud infrastructures
- Applications which do not support clustering or other high availability options are generally not well suited for off-premise public cloud given that a single server in off-premise public cloud tends not to have a clear SLA on availability
- Container friendly or cloud-native - some applications have been designed from the ground up to run in off-premise public cloud environments and likely will benefit significantly from being placed there. This is especially true of container based applications as well as cloud-native services.
- Legacy services that rely on older architectures should be examined closely. Many older architectures were not designed with cloud in mind and so may require a specific cloud to function.
- Latency sensitive applications - applications that are especially sensitive to network latency must be examined carefully to determine if the an off-premise public cloud service, where network latency is less predictable or potentially less cost-effective than on-premise private cloud, is appropriate. Examples include:
- Internal sensitivity such as cluster heartbeats, reliance on infiniband type technologies (e.g. for message-passing in HPC applications)
- External sensitivity such as database links across applications used in synchronous operations
- Identify acceptable performance or other key performance metrics in a current service or identify them for a new service such that a move to the off-premise public cloud can be measured and level of success can be determined based upon these metrics.
- Applications may have a natural affinity to a particular cloud environment, e.g. Microsoft services may run more smoothly on Microsoft Azure
- Certain services may only exist in a particular cloud or be much stronger in a certain cloud at a particular time
- Based on various prioritization decisions and constraints both within and outside OIT’s span of control, OIT may choose to support certain security or compliance requirements only in certain clouds. These will be documented as we move forward but may necessitate a specific location for certain applications.
- Data classification may drive a specific choice of environment. More guidance on this will be provided as we define off-premise public cloud security in concert with the security office.
Appendix A - Examples
Example 1: Earth Science Data Analysis
Problem: A research group uses a community-provided Docker image to re-grid Landsat and NEX imagery (approx. 1 TB per run) and to correlate the results with a precipitation map dataset. Each run can parallelize to four cores and requires about 16 GB RAM. Each run produces about 25 GB of output.
Possible solution: Because most of the necessary input data is available as Amazon Public Data sets, data gravity considerations suggest doing the computational work in EC2. Since the necessary analysis software stack is provided in an existing Docker image, EC2 is a container-friendly choice due to its natural Docker support. The required node configuration is available in EC2. Finally, any data egress charges will be modest because the size of the output results is not large.
Example 2: Earth Science Data Product Creation
Problem: A research group has contracted to produce time series analysis of satellite imagery. The input data totals about 100 TB and must be downloaded from a NASA server. The analysis process is computationally intensive and can run efficiently on a 24-core node using about 100 GB RAM. The total estimated computational time required is about 8000 node hours (ie, about one node-year.) The data product, consisting of about 150 TB, is due in 18 months.
Possible solution: In this case, cost is a significant consideration between on- and off-prem options, especially given the low need for agility and elasticity. Even if only 20 TB of the input and output data is stored in Amazon S3 at a time, at $275/TB/yr, that still adds up to $5500 not including any egress fees. An EC2 "r3.4xlarge" on-demand VM instance costs $1.33/hr; after a year of running continuously, the bill for that would be over $11,500. By comparison, RC "PetaLibrary Active" storage at $45/TB/yr would cost $13900/yr. A prioritized-access 24-core node in the RC Blanca "condo" cluster costs under $7000 for its 5-year lifetime. Thus, expected total cost in Year 1 from Amazon is about $17000 and from RC about $79008300.
Example 3: Analyzing Tweets Using a Database
Problem: A postdoc intermittently downloads data from the Twitter streaming API. She loads up to 300 GB at a time in a MongoDB database for further analysis. Her tools work best if 2500 IOPS are available for the database.
Possible Solution: In this case, no current on-prem options have acceptable performance for a high-transaction-rate or high-IOPS database, so an off-prem solution will be needed.
Example 4: Highly-Parallel Astrophysics Application
Problem: A CU faculty uses a community-developed application that scales to 64 nodes using MPI (Message Passing Interface) parallelism. He runs several 20-hr jobs every week and his workflow can tolerate a queue wait time of around 24 hrs per job.
Possible Solution: In this case, an extremely low-latency network is necessary to enable horizontal scaling of MPI performance across more than one node. Off-premise options include several High-Performance Computing cloud providers, such as Microsoft Azure HPC, which include InfiniBand networking between nodes. The on-premise Summit supercomputer also provides a specialized low-latency inter-node network. In this case, since the researcher is OK with some loss in agility due to queue wait on a shared cluster, the choice may come down to cost. Azure HPC bills at $0.11/core-hr, so a single 1536-core 20-hr job would cost him about $3400. Summit, in comparison, has no direct cost to CU Boulder researchers. Note, though, that 150 1536-core 20-hr jobs per year adds up to 4.8 million core hours per year, which is nearly 10 percent of the CU Boulder share of Summit. The faculty member may be more likely to receive an allocation of that size on a national supercomputer and may thus consider submitting a proposal to XSEDE to cover part of his HPC needs. (XSEDE resources have no direct cost to the user but do require a substantial written proposal.)
Example 5: Time-sensitive Drone Data Analysis
Problem: A research group collects LIDAR data from drone flights during a two-week field campaign. Each night they need to analyze 5 GB of data to determine the needed characteristics for the next day's flights. The different components of the analysis are computationally-intensive and run concurrently on 6 4-core nodes or VMs; each run requires 6 node-hours.
Possible solution: This workload is not very data-intensive but scales horizontally and requires bursty (elastic and short time to live) access to computational resources. It must start and finish within a specified period. Thus a dedicated on-prem resource would frequently sit idle, and a shared resource such as Summit is not guaranteed to provide the required turnaround time. In this case, spinning up VMs on demand in an off-prem elastic cloud is most appropriate.
Example 6: Database cluster (Oracle)
Problem: A packaged application is supported only on an Oracle database. The service owner would like the service to be highly available and therefore would like the database to be clustered such that a single node failure can be sustained with no loss of application availability.
Possible solution: Using Oracle’s standard cluster technology (RAC) requires a low latency heartbeat connection between nodes running the Oracle database. This configuration is latency sensitive and therefore would be better located in an on-premise private cloud where this latency can be tightly controlled.
Alternate solution: Amazon does provide for database clustering in the managed database RDS service for Oracle. If other factors drive the application to public cloud, this could be another option to consider.
Example 7: Kafka message broker
Problem: An analytics application chooses to use the Kafka message broker for message streaming and stream processing. The load on this broker will be highly variable depending on data volumes coming from public websites and administrative applications which vary in usage throughout the academic calendar.
Possible solution: This workload scales horizontally very well since Kafka is designed for many small machines. The workload is also bursty (elastic and short time to live) with a need to scale up on demand to handle big spikes in input. A dedicated on premise solution would likely sit idle most of the time waiting for the few workload spikes that occur through the academic calendar. Therefore this is a better fit for off-prem public cloud where it can take advantage of elastic resources.
Example 8: Using Identifiable Human Subject Data in Research
Problem: A researcher would like to deploy a backend for a phone app that collects some information that is something with data that is classified as HIPAA-sensitive. or similar
Possible solution: In this case, running a backend continuously for a phone app with variable usage requires agility and elasticity. Additionally, the app requires a protected environment for the HIPAA-sensitive data. An off-prem cloud solution may be the best solution given the need for dedicated, continuous compute resources to run the backend, and because on-prem storage through CU Research Computing cannot accommodate sensitive data at this time.
Appendix B - Possible Stages of on-prem IT workload to the cloud
- (Optimize) Optimize existing university resources through identification, retirement, choosing an appropriate SaaS replacement, and migrating those workloads.
- FTE time, and money
- Shift from CAPEX to OPEX. FTE change in responsibilities.
- (Lift-N-Shift) How quickly can you lift-and-shift a workload that can not be replaced by a SaaS to the cloud?
- can be used to report on utilities: space, power, HVAC
- (Reduce) How much can you reduce a cloud workload’s virtual footprint (CPU, memory, network, storage) before it’s performance is affected?
- can be used to compare on-prem CAPEX with cloud OPEX of hardware usage, lifecycling
- scale down whenever possible to avoid unnecessary spend. Shut off dev systems when you leave for the day. Periodically review all running systems are still required
- (Re-architect) Peel the onion, i.e., begin to maximize cost savings by replacing components of a workload with cloud-native solutions.
- you’re now comparing cloud spend for lift-n-shift vs optimized workload
- (Automate) Improve integration and orchestration of workloads. Infra-as-Code
- FTE time, and money
- IaC, continuous integration, auto-scaling, automation lead to fewer mistakes. Easy to adapt code for new environments.
(1), (4), and (5) lend themselves to quicker Time-to-Science (succeed or fail, both contribute to moving science forward), toward better security, improved auditing abilities, etc. Of course, measuring intangible values is nearly impossible with an IT-only view, and hopefully one day the business can help turn value into a quantitative number.