(This blog is the third installment of a four-part series)
Kubernetes can automatically provision “remote persistent” volumes with random names
Several types of storage volumes have built-in Kubernetes storage classes that enable provisioning volumes in a dynamic fashion, creating remote persistent volumes as necessary when a container is spun up for the first time. This provisioning of storage is useful for a scenario where the cluster lifetime is definitive, such as within a development cluster. In such a development environment, containers can come and go within the cluster, and any re-created container will remount any persistent disk that was previously created for the same container, as long as the cluster lives. The dynamically-generated volumes have names generated by the underlying storage class, typically a random string.
Storage that is automatically provisioned is also deleted by Kubernetes, in general, when the corresponding PersistentVolume object is deleted, such as when the cluster is deleted. In other words, the default “Reclaim Policy” of typical stateless containers are set to instruct Kubernetes to delete the volume when finished. This can be changed in a storage specification.
Even when retained, these volumes would be hard to track and remount in a new cluster because their names are typically generated as a random string.
StatefulSet offers a predictable, automated naming pattern with a default “retain” policy
StatefulSets offer a feature with their attribute “VolumeClaimTemplate” that controls the name of any generated volume. Combined with their attribute for “persistentVolumeReclaimPolicy”, which defaults to “Retain”, StatefulSets can easily generate volumes with well-defined names that persists past the lifetime of the initial set.
Reaching outside of Kubernetes to create volumes
In production, a typical deployment strategy requires that storage for long-lived data continues to persist no matter what happens with any cluster using that data (assuming no process explicitly deletes the data). It is best to provision with names that provide easy tracking and content identification. This can be done within a given IaaS platform and provided to the Kubernetes cluster, to be mounted by name.
For example, in Google Cloud Platform, a command like
gcloud compute disks create --size=20GB my-sample-volume-for-content-xyz
will create a volume with a given name. This volume will continue to exist for as long as the GCE account specifies. One way to access this volume within Kubernetes is to refer to the volume by name, such as with a pod yaml like:
apiVersion: v1
kind: Pod
metadata:
name: my-host
labels:
app: my-app
spec:
hostname: my-host
containers:
- name: gpdb
image: gcr.io/my-project/my-image
env:
volumes:
- name: pgdata
persistentVolumeClaim:
claimName: my-claim-gce
---
kind: PersistentVolumeClaim
apiVersion: v1
metadata:
name: my-claim-gce
labels:
app: my-app
spec:
accessModes:
- ReadWriteOnce
storageClassName: "" # the storageClassName has to be specified but can be empty
resources:
requests:
storage: 10Gi
selector:
matchLabels:
app: my-app
---
apiVersion: v1
kind: PersistentVolume
metadata:
name: my-host-pv
labels:
app: my-app
spec:
capacity:
storage: 10Gi
accessModes:
- ReadWriteOnce
gcePersistentDisk:
pdName: my-sample-volume-for-content-xyz # this is the linkage with a pre-created volume
Local Persistent Volumes may offer performance gains, at the cost of complexity
Kubernetes 1.10 added, as a beta feature, access to Local Persistent Volumes.
Particularly in “raw block” mode, local persistent volumes imply a significant performance gain, but at the cost of deployment challenges. If a stateful app’s performance depends on storage throughput, this trade-off may be worthy of investigation. For example, Salesforce has described their preference for local persistent volumes.
Local persistent volumes are, by definition, local to the nodes on which they have been physically attached and mounted. This contrasts with remote persistent volumes, wherein Kubernetes causes a container to perceive a mounted volume, but the Kubernetes network layer meditates the communication between container and volume. In other words, a remote persistent volume can be easily remounted on another node, while a local persistent volume cannot. Therefore, when stateful data is already present on a local persistent volume, a stateful app must help the Kubernetes system schedule the the appropriate container on the appropriate node that has the appropriate local data. Managing this topology is much more complex and much less flexible than having a remote persistent volume where any node can generally mount any remote volume.
Rescheduling containers onto the nodes where their data already resides
Kubernetes has some automatic affinity when replacing a container into an existing deployment. Remote Persistent Volumes that were mounted when a container was initially launched will generally be matched and remounted to a container that is recreated, on any node, while the original deployment is still in effect.
However, when a wholesale change happens, such as when a Kubernetes cluster is wiped and a new one is recreated, how can an app find any existing data, particularly in light of local volumes that cannot be moved?
One strategy is to use DaemonSets to investigate all nodes and attach labels that will help Kubernetes assign containers to an appropriate location.
In other words, the steps include:
- A short-lived daemon runs on each node, perhaps as a privileged container, investigating any storage found (particularly local), mounting, initializing and validating as necessary, and finally labeling the node appropriately
- The stateful app’s orchestration (e.g., an operator) adds selectors to container specifications to ensure each stateful container will be scheduled on a node that matches its storage expectation
This kind of deployment might fail if there is a gap in the storage, such as a local volume gone missing. At such times, manual intervention may be necessary.