SRE Principles for Edge Management and Improving Resiliency Using the Best of Kubernetes

February 3, 2023

This post was co-written by Kirti Apte and Gabry (Maria Gabriella) Brodi.

Over the last couple of years, customers have been adopting Kubernetes and microservice-based application deployment models for various technology and business reasons. In fact, there is a trend that customers are now looking to the next set of use cases that include applications across multiple clouds, as well as edge clouds.

While edge computing is not a new need, Kubernetes adoption at the edge is a compelling scenario for all of those companies that have edge as an integral part of their application deployment, retail and healthcare management, manufacturing, telecommunications, and logistics, etc. The ability to consume cloud native applications at the edge is of the utmost importance in order to compete in a market where milliseconds of latency can have a tremendous impact on business results.

Deploying and operationalizing applications at the edge differs significantly from a centralized management model within a data center because of its inherent complexities (e.g., quality of inbound/outbound connectivity, limited resources, lack of expertise on site, and the scale of endpoints, etc.).

In this blog, we will focus on site reliability engineering (SRE) at the edge. We will talk about effective ways to deploy, operationalize, and manage edge deployments using Kubernetes and SRE. While the core SRE principles are largely applicable to edge scenarios, these are not entirely sufficient. It is necessary to consider additional principles around recoverability, upgradability, scalability, security, and compliance. And so we will discuss these additional practices for adopting edge topologies and the SRE-based operational model.

Consider the use case scenario of a nationwide retail store in need of implementing a point of sale system, self-checkout in their stores around the nation, and the ability to enable users (e.g., store managers and business teams) to access the applications locally, while enabling the core technical team to operate and manage applications remotely. In this scenario, it can be expected that edge locations might have frequent network disruptions or power outages, as well as no operator on site. Therefore, each edge location—a.k.a. mini store—needs the ability to run and recover from power or network loss while in isolation. One more aspect to factor in is the need to keep footprints as minimal as possible.

While a hub and spoke edge architecture helps address the scenario described above, the following edge topology provided below is based on the VMware Tanzu Edge Solution Reference Architecture.

In this topology, the Kubernetes management plane resides entirely in the main data center. Only the VMware Tanzu Kubernetes Grid workload clusters are created at the edge. VMware Tanzu Mission Control is used for fleet management, allowing an easy way to manage and lifecycle Tanzu Kubernetes Grid guest clusters at the edge. Additional components for monitoring, backup, and log management can also be present in the main data center. The key points to highlight in this topology are

  • Ease of Kubernetes lifecycle management
  • Minimal footprint at the edge
  • Opportunity of recoverability when a site is disconnected

Let’s consider the previously mentioned use case and take a deep dive into SRE—and some additional—principles.

Embracing risk

Embracing the risk is weighing the cost versus customer satisfaction impact. Enterprises choose edge architectures to keep costs down (based on latency impacts), expand easily, and, at the same time, provide customer satisfaction by improving performance, and enhancing security, reliability, and scalability. SRE at the edge needs to consider the tradeoff between cost and meeting customer-driven key performance indicators (KPIs). Some of the ways to maintain a low cost at the edge are listed below:

  • Make sure only necessary resources are allocated, which are needed for processing and analysis.
  • Ensure the edge architecture is less complex and therefore easier to monitor and upgrade.
  • Consider faster recoverability of the system over high availability. Occasional service interruption can be tolerated for a “short” amount of time (recovery time objective and recovery point objective are on a per case basis).
  • Establish a flexible architecture so that only part of the application can be switched off, instead of the entire system.

Additionally, in order to keep the cost down, engineers need to make recommendations to invest in proper observability tooling to conduct root cause analysis and postmortem.

Service level objectives

Service level indicators (SLIs) and service level objectives (SLOs) are two of the most important concepts in SRE, and are key to establishing an observability culture. Edge distributed systems add an extra layer of complexity for setting up SLIs and SLOs. We can define setup practices for edge deployments as follows:

  • Identify system boundaries
  • Define each system’s capabilities
  • Define SLIs and SLOs needed for each system based on the capabilities
  • Measure, adjust, and retune SLIs, as well as define SLOs
  • Communicate SLIs and SLOs to the organization
  • Generate consolidated dashboards for all edge sites to observe and monitor any breached SLIs and SLOs, identify root cause, and take actions.

Consider the example of our use case scenario and let’s apply the above paradigms to define the SLIs and SLOs.

Let’s break down the hub and spoke topology in regards to our example. The central datacenter management cluster provisions and lifecycle manages the workload clusters at each edge location. Application workloads are deployed on the workload clusters. As the image shows, there are five tiers of applications: different system boundaries can be either UI, API, database, storage, or network tier, and capabilities are identified for each tier. For example, for the UI tier, the capability can be to deliver responsive, fast, and reliable UI. Multiple service level agreements for the UI tier can be response time, error rate, and duration. These SLIs can be tracked using VMware Tanzu Observability by Wavefront as shown below.

Consolidated custom dashboards need to be created to track these SLIs for all edge locations. These indicators can be monitored and tuned and can be used later to define alerts and scaling policies. ​​However, when dealing with edge topology, you need to consider offline scenarios as well using root cause analysis and postmortem.

Eliminate toil

For edge deployments, there are a few ways to eliminate toil:

  • Identify and automate repetitive tasks. (Automation is explained in  a later section of this blog.)
  • Create standards and consistency for tools and processes across all edge sites.
  • Document standard practices for recoverability of edge sites in case they are disconnected from the main site.

For our example use case scenario, we can define the following processes to manage installation, patches, and upgrade for the edge sites.


General considerations for monitoring edge deployments are as follows:

  • The monitoring solutions need to be able to store and forward from edge sites and must be configurable for frequency, filtering, local storage for buffering, prioritization, and more.
  • Collector agents and proxies need to consume the least amount of system resources (e.g., CPU and memory), allowing business applications the most space to operate.
  • Local access to logs/metrics also needs to be available, in case outbound network connectivity is lost.
  • Event logs and monitoring information should be temporarily stored or queued locally until the next synchronization window.

Monitoring, logging, and tracing agents are deployed on each cluster and can be pushed to edge clusters from centralized locations. These agents can collect desired metrics from each edge location and send it to the observability tool to aggregate data from different sources, as well as analyze and process the data to create analytical dashboards. For edge deployments, the most common metrics you can focus on are latency, traffic (amount of load the service is experiencing), error rate, and saturation (how often the service request fails). Most often, these are the metrics you'll measure as components of your SLOs.

For our use case, we are using Tanzu Observability to implement monitoring, logging, and tracing for edge sites. This SaaS platform collects and displays metrics and traces data from the full stack platform, as well as from applications. Being a SaaS, outbound internet connectivity is required at edge sites, however, it provides the ability to create alerts tuned by advanced analytics, assists in the troubleshooting of systems, and helps you understand the impact of running production code.

The metrics that are collected can originate from either your infrastructure or application. Additionally, Kubernetes clusters at edge locations can install the Tanzu Observability collector as an extension in the Tanzu Kubernetes Grid to provide visibility into cluster operations and workload characteristics. This helps operators ensure a large fleet of clusters remain healthy, and enables rapid response when health characteristics change.

Edge Kubernetes cluster fleet observability

Edge Kubernetes cluster health

Grafana and Prometheus as an alternative for on-premises solutions

Identification and automation of repetitive tasks is a key factor to improve development velocity and achieve a faster speed to market.

For edge deployments, the following areas can be considered for automation:

  • Deployment – Automating the provisioning of the main site and edge site infrastructures and application resources can be done in several ways.

Any automation tool (e.g., Ansible, Terraform, or configuration as a code with GitOps) can be used to provision infrastructure and deploy applications.

For our use case, the following configuration as a code is used to provision the management cluster at the main data center and workload clusters at the edge sites.

#! ---------------------------------------------------------------------
#! Basic cluster creation configuration
#! ---------------------------------------------------------------------

CLUSTER_NAME: tkg-edge33-wld01
# CNI: antrea

#! ---------------------------------------------------------------------
#! Node configuration
#! ---------------------------------------------------------------------


#! ---------------------------------------------------------------------
#! vSphere configuration
#! ---------------------------------------------------------------------
VSPHERE_SERVER: vcenter-edge.tanzu.lab
VSPHERE_DATASTORE: /edge33-dc/datastore/edge33-vsan
VSPHERE_FOLDER: /edge33-dc/vm
VSPHERE_NETWORK: /edge33-dc/network/edge33-vds01-tkg
VSPHERE_RESOURCE_POOL: /edge33-dc/host/edge33-Cloud/Resources
VSPHERE_SSH_AUTHORIZED_KEY: "ssh-rsa AAAA[...]= console"
VSPHERE_USERNAME: administrator@vsphere.local

#! ---------------------------------------------------------------------
#! Machine Health Check configuration
#! ---------------------------------------------------------------------


#! ---------------------------------------------------------------------
#! Common configuration
#! ---------------------------------------------------------------------





OS_NAME: photon
OS_ARCH: amd64

#! ---------------------------------------------------------------------
#! Autoscaler configuration
#! ---------------------------------------------------------------------

  • Testing – Tools can simulate use of your services to find bugs and test how your system handles load.
  • Scanning – Container image and application source code scans can be automated.
  • Incident response – Automated runbooks can be maintained to quickly resolve the incidence at any edge site.
  • Communication – Tools can send messages to collaboration channels and log key events.

Release engineering

Release engineering includes building and deploying software in a consistent, stable, and repeatable way. Here is a summary of the tools and practices that we used for the continuous delivery described in the next section:

  • For our use case to achieve consistency, stability, and repeatability we are following GitOps flow that allows us to use a declarative approach.
  • The registry of choice is Harbor, and of its many features, the two in particular that we leverage are replication and the trigger engine.
  • For continuous delivery, Argo CD works in conjunction with Flagger for progressive delivery. The combination of these two tools provide an easy but powerful approach to delivery through GitOps.
  • Contour is an ingress controller that allows dynamic configurations.
  • Prometheus collects metrics.

Memory consumption and the bandwidth for data transfer can be reduced by building container images to optimize reusability of layers across containers based on Docker V2, Schema 2 image specification.

 "schemaVersion": 2,
 "mediaType": "application/vnd.docker.distribution.manifest.v2+json",
 "config": {
 "mediaType": "application/vnd.docker.container.image.v1+json",
 "size": 7023,
 "digest": "sha256:b5b2b2c507a0944348e0303114d8d93aaaa081732b86451d9bce1f432a537bc7"
 "layers": [
 "mediaType": "application/vnd.docker.image.rootfs.diff.tar.gzip",
 "size": 32654,
 "digest": "sha256:e692418e4cbaf90ca69d05a66403747baa33ee08806650b51fab815ad7fc331f"
 "mediaType": "application/vnd.docker.image.rootfs.diff.tar.gzip",
 "size": 16724,
 "digest": "sha256:3c3a4604a545cdc127456d94e421cd355bca5b528f4a9c1905b15da2eb4a4c6b"
 "mediaType": "application/vnd.docker.image.rootfs.diff.tar.gzip",
 "size": 73109,
 "digest": "sha256:ec4b8955958665577945c89419d1af06b5f7636b4ac3da7f12184802ad867736"

By using Cloud Native Buildpacks, the organization can standardize each layer for maximum reusability across teams and applications. Each layer can exist as a single copy that is referenced by the image manifest using its digest. As the container images are downloaded to the edge using layers, it reduces the amount of data to be stored and transferred from the datacenter to the edge. Harbor fully supports Docker V2 image specification and allows leveraging this advantage when storing container images. This flow is summarized below.

Once the new image is pushed to a centralized Harbor repository, the image is replicated to canary edge locations as shown (Step 2). The canary operation repository is updated once the image is pushed at the edge location (Step 3). The Argo CD detects the new configuration change and triggers deployment (Step 4). We use Flagger to implement canary deployment to distribute traffic between old and new versions of the application. You can collect the metrics of old and new versions of the application using Prometheus (Step 5). Flagger checks the metrics and health conditions of the canary deployment to promote or stop the rollout (Step 6).

Once the edge canary testing is successful (Step 1), then the main data center Harbor is updated with the new image SHA (Step 2). The new configurations are merged to the master GitOps repository. The Argo CD at each location detects a new configuration change and updates a new deployment. It also checks metrics and health conditions and reports success or failure of each site (Step 3).


For edge deployments simplicity is very critical. Edge topology needs to ensure it is the least complex system that still performs as intended. The goals of simplicity and reliability go hand in hand. In edge deployments, a simple system is easier to monitor, repair, and improve.

Additional principles for edge

Based on our experiences, these additional principles on top of SRE principles ensure that you can upgrade, recover, scale, and secure enterprise-grade edge deployments. Let’s visit these additional principles here.


While an edge location could be considered like any other data center, this is not the case when it comes to availability and recoverability. First of all, an edge location is usually tied to business hours (e.g., retail stores, offices, etc.). Additionally, the cost of the edge location is expected to be low. As a result, the recovery point objective can still be considered critical (sales transactions), but the recovery time objective is not as critical as a normal data center, especially when it can be run on the edge.

Consider the following outage scenarios:

Edge site disconnected from main site

It is expected that the application running at the edge location should continue to run even if the connectivity with the main data center is lost; however, management of the site might not be possible.

In our example with hub and spoke topology there won’t be any impact on the application if the edge site is disconnected from the main site. In fact, deploying a new application or edge cluster lifecycle management operations—such as upgrade—at scale is not possible.

Local hardware failure

It is expected that the application running at the edge location should continue to run even in case there are hardware failures, such as host or disk failure. In our example, we can achieve this by running on a two host architecture with shared storage (e.g., vSan). This allows for the loss of one of the nodes.

Loss of site

If the site is irremediably lost (from an IT standpoint), once the hardware and the base infrastructure has been rebuilt/recovered, it is expected that the cloud native stack can be rebuilt in a matter of hours.


Refresh cycles of software and hardware

Tanzu Kubernetes Grid management has to be upgraded to the new version, and Tanzu Kubernetes Grid OVA templates must be imported at the edge location. This can be performed at any time since it doesn't impact running Tanzu Kubernetes Grid workload clusters. When the time is right, each Tanzu Kubernetes Grid workload cluster at each edge location can be independently upgraded.

tanzu cluster upgrade tkg-edge33-wld01 --tkr v1.20.4---vmware.1-tkg.1 –yes --namespace edge33

Rolling upgrades are available for the platform at the expense of one extra virtual machine (VM) node until the upgrade is completed, then the extra VM is deleted.


In the centralized deployment model, oversized data centers are built for an expected IT load, typically with the rare and worst case load in mind. Mini edge sites can easily be “stepped and repeated” to accommodate growth as the need for more computers arises. Once fully utilized, another site is deployed in the same facility or even in a different facility, depending on the available electrical, space, and bandwidth capacity. Their standardized, prefabricated nature along with smaller resource increments is fundamentally what makes them a highly scalable solution compared to traditional “stick-built,” purpose-built data centers.

For example, in the above use case, when the workloads on the Tanzu Kubernetes Grid workloads cluster get added, it might need to scale the cluster by adding more nodes (which can be done horizontally using the VMware Tanzu CLI).

tanzu cluster scale tkg-edge33-wld01 --worker-machine-count 4 --namespace edge33

It is also possible to change the size of the existing nodes by performing vertical scaling, but it requires updating the infrastructure machine template.

The SRE principles as understood for data center deployments can be adapted and adopted in edge scenarios as well. Tanzu Kubernetes Grid and related solution offerings provide multiple capabilities that can play a critical role in carrying out SRE functions in edge scenarios.

Tanzu Kubernetes Grid 2.1 Enhances Lifecycle Management and Extends Kubernetes to the Edge
Tanzu Kubernetes Grid 2.1 Enhances Lifecycle Management and Extends Kubernetes to the Edge

Tanzu Kubernetes Grid has become a trusted tool to automate the lifecycle of Kubernetes clusters. The simpl...

VMware Tanzu and AWS Accelerate Apps: Key Takeaways from AWS re:Invent 2022
VMware Tanzu and AWS Accelerate Apps: Key Takeaways from AWS re:Invent 2022

After a successful AWS re:Invent 2022, VMware Tanzu and AWS continue to be in alignment in providing custom...