Any good DevOps practitioner will tell you that orchestration, integration, and measurement are some of the key ingredients to a successful DevOps pipeline. Orchestration is critical to accelerating deployments and ensuring they are executed consistently. The bottleneck to a CI/CD pipeline will always be the task that is not automated and doesn’t work with an orchestration engine. In other words, it is manual and not integrated.
Integration is essential to working with different tools, including orchestrators and automation toolsets. Open APIs and well-defined standards are requirements to make DevOps admins comfortable incorporating a solution into their toolset. The tool should also be relatively simple. It’s more important to have a tool that does one thing very well, than having a tool that does many things poorly, aka the Swiss Army Knife effect.
Tools or resources also need to provide feedback to the larger ecosystem. A good orchestrator engine needs to provide appropriate levels of feedback to the pipeline so process improvements can be made. It is also assumed that the deployed solution will have the correct levels of instrumentation, providing feedback and performance information to the orchestration tool so that it can make intelligent decisions.
There is a certain level of confusion between orchestration and automation. The terms tend to be used in close proximity, and as a result they are often conflated. But make no mistake, automation and orchestration are two distinct entities. Automation is taking a manual process and using some technology to make the process automatic. A classic example would be scripting user provisioning in Active Directory. The process for creating a new user involves multiple steps that can be executed manually in the Active Directory Users and Computers management console, but that process does not scale and is prone to mistakes. Most administrators will at some point script out the user provisioning process so that they can ensure it is performed consistently and can be used on a larger scale. Scripting out the user provisioning process is automation. A manual process has been converted to an automated one.
The user provisioning process in Active Directory is only one component of the onboarding process for a new user. They are going to need a mailbox, a home drive, access to applications, and a workstation. Each of those tasks can also be automated using scripting or software. An orchestration application would take each of these tasks and control the workflow and firing of each task based on a logical pipeline. Microsoft even has a product to do something like that, unsurprisingly called System Center Orchestrator at one point in the product’s lifecycle.
Orchestration takes multiple automated processes and binds them together into a cohesive workflow. As important as orchestration might be to provision new users in an organization, orchestration is even more important when provisioning and managing cloud-native applications. Some of the key factors in a cloud-native application are automated provisioning of infrastructure, continuous delivery of software through a DevOps pipeline, and loosely coupled, modular services. The DevOps pipeline requires orchestration to ingest code, create a build, test it, and deploy it automatically. Infrastructure deployment and management needs to be orchestrated so that it can react to the changing needs of the application it supports. And the actual application itself needs an orchestrator to deploy the application, monitor its state, and manage its scale.
As things move towards a multi-cloud world, orchestration can be a great benefit, ensuring that an application is deployed consistently and rapidly to any cloud environment that is advantageous to the business. While that sounds ideal, data and data gravity throw a wrench in the gears of the orchestration machine. Most components of an application are easily re-deployed in a new environment, since they do not store state. The compute and networking can be defined in Infrastructure as Code, and the stateless portions of the application can be packaged up in containers. The hardest thing to move is the data, and so all other components of an application tend to gravitate towards the date store. A concrete example would be an ERP system with a 10TB database backend housed in AWS. The web tier and application tier of the application may be stateless, or only have a small about of state data. They could easily be moved to Microsoft Azure, but the performance of the application would be terrible if the database doesn’t move as well. The 10TB database creates a data gravity well that keeps all other application components orbiting in close proximity.
Managing the movement of data in a multi-cloud world is a considerable challenge. There are several possible approaches to solve the issue. Each data source could replicate synchronously to a target in another cloud. Aside from being tremendously expensive, this solution would require an acknowledgement from the target side prior to committing any write operation. That may work in a setup where the data source and target are in close proximity, but synchronous data replication across public clouds is going to introduce a performance penalty that outweighs the benefit.
Another alternative is for each data source to perform asynchronous replication to a target in another cloud. This solution would still be very expensive. Cloud vendors charge based on egress traffic, and an asynchronous replication of all data would create a lot of egress traffic. The storage cost for the replication would also be double, since the data is being stored in at least two places. Taking the example of a 10TB database in AWS, the storage cost of housing a single instance of the database would be roughly $0.10 a GB or $1,000 for a the full 10TB on EBS storage. The cost of housing a corresponding database in Microsoft Azure would be about $1,200 for 10TB of storage on Premium SSD Managed Disks. Assuming that the change rate is roughly 10% a month, then 1TB of changes would egress the AWS network every month, costing roughly $100. The more writes an application has, the higher the cost of egress data.
Rather than replicating all of the data, it might be useful to replicate only the metadata. In an active-active multi-cloud deployment, knowing where the data is stored and some information about it, might be more important that actually housing the data locally to every instance of the application. An intelligent load balancing mechanism could redirect the session to the public cloud instance that houses the data being requested.
In an ideal world, data replication would be orchestrated as part of the application deployment. Decisions about the level of replication – synchronous, asynchronous, metadata only – could be baked into the configuration of the application and executed properly by an orchestration layer. Of course, this assumes some type of common data storage layer across multiple clouds. It would also be ideal if the solution could perform some type of data deduplication and compression prior to sending replicated data across the wire. That would save dramatically on storage and network egress costs, as well as potentially speeding up the replication process.
At Cloud Field Day 3, Eiki Hrafnsson from NetApp demonstrated their Cloud Volumes and Cloud Volumes ONTAP. Those two services could help form the storage layer across clouds to support simplified replication of data. Since both of the services are accessible via an API, it would also be possible to automate the provisioning and configuration process of the storage layer, integrating it into an orchestration suite. He also demonstrated the NetApp Cloud Orchestrator, which is capable of orchestrating storage provisioning and application deployment via Kubernetes.
At first it seemed strange that NetApp had developed their own multi-cloud application orchestrator, but in fact they are using Qstack, part of what NetApp gained from the Greenqloud acquisition. Instead of leaving Qstack to die on the vine as an acquisition casualty, NetApp has wisely chosen to continue developing the product. Their decision reflects their new corporate messaging as a Data Management company. Even if the Qstack product never takes off, it shows other orchestration platforms how they could be using NetApp’s products to automate the management of storage across multiple clouds. To further assist the multi-cloud approach, NetApp also demonstrated Cloud Sync. This product enables the replication, migration, and synchronization of data between disparate sources. The process flows through a NetApp Data Broker, which ideally would be applying some level of compression and deduplication to the flows. The documentation on NetApp’s website did not make it clear whether those options are available. The API for Cloud Sync is available for consumption and is called datafabric.io. Data Fabric is one of the central concepts NetApp advances as part of the Data Management message.
As alluded to in the previous section, orchestration is the assembly of multiple processes into a cohesive workflow. Proper orchestration presupposes that the tasks being orchestrated can be integrated into an orchestration engine. In the past, most operations were automated with scripts using an imperative approach, e.g. go create a user account. More recently, services and applications are opening their APIs up to outside sources, and automation can now reference an API endpoint with a declarative set of instructions, e.g. here is a new user to create. The rise of the API has made it simpler to automate processes, and it has made orchestration of those processes more straightforward by providing integration points to the orchestration engine. In the previous example, provisioning a new user in Active Directory would likely be accomplished using the Active Directory PowerShell module. Unfortunately, Active Directory doesn’t have a simple API endpoint for easy integration. As a counterpoint, Azure Active Directory does have an API endpoint through the Microsoft Graph, and while users can still be provisioned using PowerShell, it is just as easy to write a custom integration in any programming language. Doing so in a declarative way abstracts the how of a task and allows admins to focus on the what, i.e. “I need a user provisioned”, rather than, “Here is how to provision a user.”
Orchestration engines do not live in a vacuum, and it’s quite possible that different orchestration engines will be used for different workflows and applications. Each public cloud uses an orchestration engine of its own to provision and manage resources. When a new Azure VM is created in the Azure portal, a workflow is kicked off to provision the virtual machine, the storage it will use, a virtual network interface, and in some cases a whole virtual network. The Azure Resource Manager farms these tasks out to Resource Providers and orchestrates the provisioning of resources to produce a functioning virtual machine as an end result. Looking at another layer of the stack, Kubernetes orchestrates the lifecycle of application workloads on nodes in a cluster. Kubernetes is responsible for deploying containers, configuring the network, carving out storage, and monitoring the health of the nodes in its cluster. Moving up the stack again, Jenkins manages the workflow of a CI/CD pipeline. It is orchestrating the integration and delivery of applications through multiple stages of a pipeline, each of which could be calling some other orchestration tool to accomplish the tasks in that stage. From this perspective, it’s turtles all the way down.
While this may start to seem like unnecessary complexity, the reality is that each orchestration engine and automation tool should follow the Unix ethos of doing one thing and doing it very well. When an orchestration engine tries to do too many things, it ends up a bloated mess that doesn’t do anything particularly well. It is far better to take a building block approach, where different automation and orchestration tools snap together to suit the unique needs of each application and organization. The best way to accomplish that is for vendors to provide simple integration points for other tools to integrate with their offering.
During the NetApp briefing at Cloud Field Day 3, Eiki mentioned several times that NetApp is not a storage company, they are a Data Management company providing a Data Fabric. At first blush, that sounds like effusive, meaningless jargon. What do they mean by Data Management and Data Fabric? After further consideration, it does appear that they are onto something. Data Fabric sounds like something you might pick up at Joann Fabrics, along with a bolt of velour for a brand-new tracksuit. In reality, the Data Fabric could better be described as the layer that manages and provisions storage across a multi-cloud environment. That is certainly something that NetApp is trying to do. An example of how they are integrating into orchestration tools is the Trident dynamic storage provisioning tool that works with Kubernetes. Trident allows Kubernetes to dynamically provision new storage, such as iSCSI LUNs or NFS volumes by making a call to Trident through the StorageClass API object. While Trident was developed for use with Kubernetes, it could be adapted to integrate with other orchestration tools, such as Apache Mesos or HashiCorp’s Terraform. The creation of integration tools places NetApp in a position to be a data fabric provider for an organization.
One of the often-overlooked components of DevOps is the monitoring and improvement portion of the cycle. It is great to automate, orchestrate, and release things at a frighteningly fast cadence. While it might be easy to tell if things are going well or poorly, without proper monitoring it is incredibly difficult to know why. Since DevOps should be a constant cycle of improvement, knowing the why is critical to making informed changes that improve the overall stability and performance of an application.
Each component that is being orchestrated and automated needs to be able to report back its state and key metrics. Compute resources need to report back their current CPU load, memory consumption, and thread count. Network resources need to report their current bandwidth usage, packet flows, and port status. Likewise, storage needs to be able to report on key metrics that will help the orchestration tools make automated decisions or surface up that data to an operator who can decide on changes that need to be made.
As machine learning evolves and gains maturity, there is an expectation that orchestration platforms will start assisting with the decision-making process, and in some cases make decisions on their own. There are a number of areas where advanced orchestration could assist:
- Self-healing (running out of space, service is down)
- Cost saving (move a workload, reduce network egress)
- Performance boosting (change app tier, storage tier, load into cache)
By constantly monitoring Key Performance Indicators (KPIs), an advanced orchestration tool can make intelligent decisions regarding the configuration of the application. For example, if a particular dataset resides in lower tier storage, like S3 Infrequent Access, and it is accessed a certain number of times, then an orchestration tool might automatically move the dataset to standard S3 storage. A more complicated process might monitor the cost of spot instances in AWS, and only provision batch workloads when the price drops below a specific number.
It should come as no surprise that storage vendors have always provided some level of KPIs for their product. Unfortunately, many of the storage vendors required specialized tools to extract and analyze more complicated information. Viewing the historical I/O stats and hardware performance for some arrays required a vendor partner or SA to take log captures from the client and run it through proprietary software the client didn’t have access to. That type of situation doesn’t really work with the DevOps approach to monitoring. A more germane example of monitoring comes from the public cloud providers, like Azure with Azure Monitoring. That service is capable of ingesting large amounts of performance data from services both in Azure and outside, and then performing some level of analysis and analytics based on the ingested information. The service can then take some level of action or be queried by some other orchestration tool to make decisions.
Although it had not been announced during the Cloud Field Day 3 presentation, Eiki did hint that NetApp was going to enter the monitoring space. Since then, NetApp has announced a preview of their Cloud Insight service, which is intended to monitor cloud-based infrastructure for performance, cost, and optimization. This is a red-hot space, based on the acquisition of CloudHealth by VMware, Cloud Cruiser by HPE, and Cloudyn by Microsoft. Being able to monitor cloud resources and optimize their usage is critical not only to the DevOps sect, but also to the Finance department of organizations.
Closing the Loop
Measuring and optimizing application performance closes the DevOps feedback loop and helps drive real business value to the organization. Whether that value is reducing cost, improving efficiencies, or increasing adoption, being able to properly measure key metrics and make informed decisions is a foundational need of any organization that is trying to adopt a DevOps culture. Products and services that provide the necessary level of instrumentation for proper measurement, simple to use integration points, and orchestration capabilities will find they have an edge over the competition. NetApp seems to understand this, and it would be worthwhile for other vendors to take notice and respond in kind.