This article will guide you through the best practices to deploy and distribute the workload on a multi-cloud Kubernetes environment on Scaleway's Kosmos.
It follows the first part of the Hands-On prepared for Devoxx Poland 2021: Best practices to configure a multi-cloud Kubernetes cluster that we wanted to make available for everyone.
⚠️ Warning reminder
This article will balance between concept explanations and operations or commands that need to be performed by the reader.
If this icon (🔥) is present before an image, a command, or a file, you are required to perform an action.
So remember, when 🔥 is on, so are you!
Redundancy
🔥 Labels
First, we are going to start by listing your nodes, and more specifically their associated labels. The kubectl get nodes --show-labels
command will perform this action for us.
🔥 kubectl get nodes --show-labels --no-headers | awk '{print "NODE NAME: "$1","$6"\n"}' | tr "," "\n"
Output
NODE NAME: scw-kosmos-kosmos-scw-09371579edf54552b0187a95 beta.kubernetes.io/arch=amd64 beta.kubernetes.io/instance-type=DEV1-M beta.kubernetes.io/os=linux failure-domain.beta.kubernetes.io/region=nl-ams failure-domain.beta.kubernetes.io/zone=nl-ams-1 k8s.scaleway.com/kapsule=b58ad1f6-2a4d-4c0b-8573-459fad62682f k8s.scaleway.com/managed=true k8s.scaleway.com/node=09371579-edf5-4552-b018-7a95e779b70e k8s.scaleway.com/pool-name=kosmos-scw k8s.scaleway.com/pool=313ccb19-0233-4dc9-b582-b1e687903b7a k8s.scaleway.com/runtime=containerd kubernetes.io/arch=amd64 kubernetes.io/hostname=scw-kosmos-kosmos-scw-09371579edf54552b0187a95 kubernetes.io/os=linux node.kubernetes.io/instance-type=DEV1-M topology.csi.scaleway.com/zone=nl-ams-1 topology.kubernetes.io/region=nl-ams topology.kubernetes.io/zone=nl-ams-1 NODE NAME: scw-kosmos-worldwide-5ecdb6d02cf84d63937af45a6 beta.kubernetes.io/arch=amd64 beta.kubernetes.io/os=linux k8s.scw.cloud/disable-lifecycle=true k8s.scw.cloud/node-public-ip=151.115.36.196 kubernetes.io/arch=amd64 kubernetes.io/hostname=scw-kosmos-worldwide-5ecdb6d02cf84d63937af45a6 kubernetes.io/os=linux topology.kubernetes.io/region=scw-kosmos-worldwide-5ecdb6d02cf84d63937af45a6 NODE NAME: scw-kosmos-worldwide-b2db708b0c474decb7447e0d6 beta.kubernetes.io/arch=amd64 beta.kubernetes.io/os=linux k8s.scw.cloud/disable-lifecycle=true k8s.scw.cloud/node-public-ip=65.21.146.191 kubernetes.io/arch=amd64 kubernetes.io/hostname=scw-kosmos-worldwide-b2db708b0c474decb7447e0d6 kubernetes.io/os=linux topology.kubernetes.io/region=scw-kosmos-worldwide-b2db708b0c474decb7447e0d6
For each of our three nodes, we see many labels. The first node on the list has considerably more labels as it is managed by a Kubernetes Kosmos engine. In this case, more information about features and node management is added.
🔥 Adding labels to distinguish Cloud providers
As it might not be easy to remember which node comes from which provider, and as it can help us distribute our workload across providers, we are going to label our nodes with a label called provider
with values such as scaleway
or hetzner
.
kubectl label nodes scw-kosmos-kosmos-scw-09371579edf54552b0187a95 provider=scaleway
kubectl label nodes scw-kosmos-worldwide-5ecdb6d02cf84d63937af45a6 provider=scaleway
kubectl label nodes scw-kosmos-worldwide-b2db708b0c474decb7447e0d6 provider=hetzner
In addition, we are also going to add label to our unmanaged Scaleway node to specify that it is, in fact, not managed by the engine. For that we use the same label used on the managed Scaleway node, but set to false: k8s.scaleway.com/managed=false
.
kubectl label nodes scw-kosmos-worldwide-5ecdb6d02cf84d63937af45a6 k8s.scaleway.com/managed=false
🔥 Listing our labels
Let's list our labels to ensure that the provider
label is well set on our three nodes.
🔥 kubectl get nodes --show-labels --no-headers | awk '{print "NODE NAME: "$1","$6"\n"}' | tr "," "\n"
Output
NODE NAME: scw-kosmos-kosmos-scw-09371579edf54552b0187a95 beta.kubernetes.io/arch=amd64 beta.kubernetes.io/instance-type=DEV1-M beta.kubernetes.io/os=linux failure-domain.beta.kubernetes.io/region=nl-ams failure-domain.beta.kubernetes.io/zone=nl-ams-1 k8s.scaleway.com/kapsule=b58ad1f6-2a4d-4c0b-8573-459fad62682f k8s.scaleway.com/managed=true k8s.scaleway.com/node=09371579-edf5-4552-b018-7a95e779b70e k8s.scaleway.com/pool-name=kosmos-scw k8s.scaleway.com/pool=313ccb19-0233-4dc9-b582-b1e687903b7a k8s.scaleway.com/runtime=containerd kubernetes.io/arch=amd64 kubernetes.io/hostname=scw-kosmos-kosmos-scw-09371579edf54552b0187a95 kubernetes.io/os=linux node.kubernetes.io/instance-type=DEV1-M provider=scaleway topology.csi.scaleway.com/zone=nl-ams-1 topology.kubernetes.io/region=nl-ams topology.kubernetes.io/zone=nl-ams-1 NODE NAME: scw-kosmos-worldwide-5ecdb6d02cf84d63937af45a6 beta.kubernetes.io/arch=amd64 beta.kubernetes.io/os=linux k8s.scaleway.com/managed=false k8s.scw.cloud/disable-lifecycle=true k8s.scw.cloud/node-public-ip=151.115.36.196 kubernetes.io/arch=amd64 kubernetes.io/hostname=scw-kosmos-worldwide-5ecdb6d02cf84d63937af45a6 kubernetes.io/os=linux provider=scaleway topology.kubernetes.io/region=scw-kosmos-worldwide-5ecdb6d02cf84d63937af45a6 NODE NAME: scw-kosmos-worldwide-b2db708b0c474decb7447e0d6 beta.kubernetes.io/arch=amd64 beta.kubernetes.io/os=linux k8s.scw.cloud/disable-lifecycle=true k8s.scw.cloud/node-public-ip=65.21.146.191 kubernetes.io/arch=amd64 kubernetes.io/hostname=scw-kosmos-worldwide-b2db708b0c474decb7447e0d6 kubernetes.io/os=linux provider=hetzner topology.kubernetes.io/region=scw-kosmos-worldwide-b2db708b0c474decb7447e0d6
Deployment and observation: What happens in a Multi-Cloud cluster?
🔥 A first very simple deployment
To better understand the behavior of a Multi-Cloud Kubernetes cluster, we are going to create a very simple deployment using kubectl
. This deployment will run three replicas of the busybox
image, each of which will print the date every ten seconds.
kubectl create deploy first-deployment --replicas=3 --image=busybox -- /bin/sh -c "while true; do date; sleep 10; done"
Once the deployment has been created, we can observe what is actually happening on our cluster.
kubectl get all
Output
NAME READY STATUS RESTARTS AGE pod/first-deployment-695f579bd4-cfg6l 1/1 Running 0 8s pod/first-deployment-695f579bd4-jzft8 1/1 Running 0 8s pod/first-deployment-695f579bd4-rt5jt 1/1 Running 0 8s NAME TYPE CLUSTER-IP EXTERNAL-IP PORT(S) AGE service/kubernetes ClusterIP 10.32.0.1 <none> 443/TCP 53m NAME READY UP-TO-DATE AVAILABLE AGE deployment.apps/first-deployment 3/3 3 3 8s NAME DESIRED CURRENT READY AGE replicaset.apps/first-deployment-695f579bd4 3 3 3 8s
Our first observation is that our deployment
object is here, along with the three pods
(replicas) we asked for. We can also observe that another "unexpected" object was also created, a replicaset
. The replicaset
is an intermediary object created by the deployment
in charge of maintaining and monitoring the replicas.
Now, let's have a quick look inside one of our pods
to see if it performs normally.
🔥 kubectl logs pod/first-deployment-695f579bd4-cfg6l
Output
Mon Sep 6 08:41:01 UTC 2021 Mon Sep 6 08:41:11 UTC 2021 Mon Sep 6 08:41:21 UTC 2021 Mon Sep 6 08:41:31 UTC 2021
We can see that our pod is writing the date every ten seconds, which is exactly what we asked it to do.
Now, the real question is, where are these pods
running? We can use the kubectl get pods
to give us the name of the node where they actually run.
🔥 kubectl get pods -o custom-columns=NAME:.metadata.name,NODE:.spec.nodeName
Output
NAME NODE first-deployment-695f579bd4-cfg6l scw-kosmos-kosmos-scw-0937 first-deployment-695f579bd4-jzft8 scw-kosmos-kosmos-scw-0937 first-deployment-695f579bd4-rt5jt scw-kosmos-kosmos-scw-0937
When listing our three pods and their location, it seems that they all run on the same managed Scaleway node (the one located in Amsterdam). That's unfortunate... Let's see if we can act on this behavior.
🔥 Scaling up
The first thing we can try is to scale up our deployment and see where all our new replicas will be scheduled.
🔥 kubectl scale deployment first-deployment --replicas=15
The scaling has been applied, we can list our pods again.
🔥 kubectl get pods -o custom-columns=NAME:.metadata.name,NODE:.spec.nodeName
Output
NAME NODE first-deployment-695f579bd4-5jq9q scw-kosmos-kosmos-scw-0937 first-deployment-695f579bd4-5t6tw scw-kosmos-kosmos-scw-0937 first-deployment-695f579bd4-5twcj scw-kosmos-kosmos-scw-0937 first-deployment-695f579bd4-5xljr scw-kosmos-kosmos-scw-0937 first-deployment-695f579bd4-8phq5 scw-kosmos-kosmos-scw-0937 first-deployment-695f579bd4-cfg6l scw-kosmos-kosmos-scw-0937 first-deployment-695f579bd4-jzft8 scw-kosmos-kosmos-scw-0937 first-deployment-695f579bd4-nf9fg scw-kosmos-kosmos-scw-0937 first-deployment-695f579bd4-nsxb6 scw-kosmos-kosmos-scw-0937 first-deployment-695f579bd4-ptlkp scw-kosmos-kosmos-scw-0937 first-deployment-695f579bd4-rgdqj scw-kosmos-kosmos-scw-0937 first-deployment-695f579bd4-rt5jt scw-kosmos-kosmos-scw-0937 first-deployment-695f579bd4-vrl95 scw-kosmos-kosmos-scw-0937 first-deployment-695f579bd4-vwv7l scw-kosmos-kosmos-scw-0937 first-deployment-695f579bd4-w9qqq scw-kosmos-kosmos-scw-0937
And they are still all running on the same node.
To go further, we are going to play with more complex configuration. In order to do so without getting mixed up with our other configurations, deployments and pods, it is best to clean our environment and delete our deployment.
🔥 kubectl delete deployment first-deployment
Output
deployment.apps "first-deployment" deleted
Yaml files
Kubectl commands are nice, but when it comes to managing multiple Kubernetes objects, configuration files are a better and more reliable fit. In Kubernetes, configurations are made in yaml
format, always following a pattern similar to the one below:
#example.yaml
-—-
apiVersion: apps/v1 # version of the k8s api
kind: Pod # type of the Kubernetes object we aim to describe
metadata: # additional options such as the object name, labels, annotations
…
spec: # parameters and options of the k8s object to create
…
Selecting where to run our pods
In Kubernetes, there are different options available to distribute our workload across nodes, namespaces, or depending on affinity, between pods
. Working in a Multi-Cloud Kubernetes environmnent makes their usage mandatory and knowing them and their behavior can rapidly become crucial.
🔥 NodeSelector
A node selector is applied on a pod and will match labels that exist on the cluster nodes. The command below gives us all information about a given node, including labels, annotations, running pods, etc...
🔥 kubectl describe node scw-kosmos-kosmos-scw-09371579edf54552b0187a95
Output
Name: scw-kosmos-kosmos-scw-09371579edf54552b0187a95 Roles: <none> Labels: beta.kubernetes.io/arch=amd64 beta.kubernetes.io/instance-type=DEV1-M beta.kubernetes.io/os=linux failure-domain.beta.kubernetes.io/region=nl-ams failure-domain.beta.kubernetes.io/zone=nl-ams-1 k8s.scaleway.com/kapsule=b58a[...] k8s.scaleway.com/managed=true k8s.scaleway.com/node=0937[...] k8s.scaleway.com/pool=313c[...] k8s.scaleway.com/pool-name=kosmos-scw k8s.scaleway.com/runtime=containerd kubernetes.io/arch=amd64 kubernetes.io/hostname=scw-kosmos-kosmos-scw-0937[...] kubernetes.io/os=linux node.kubernetes.io/instance-type=DEV1-M provider=scaleway topology.csi.scaleway.com/zone=nl-ams-1 topology.kubernetes.io/region=nl-ams topology.kubernetes.io/zone=nl-ams-1 Annotations: csi.volume.kubernetes.io/nodeid: {"csi.scaleway.com":"[...]"} kilo.squat.ai/discovered-endpoints: {} kilo.squat.ai/endpoint: 51.15.123.156:51820 kilo.squat.ai/force-endpoint: 51.15.123.156:51820 kilo.squat.ai/granularity: location kilo.squat.ai/internal-ip: 10.67.36.37/31 kilo.squat.ai/key: cSP2[...] kilo.squat.ai/last-seen: 1630917821 kilo.squat.ai/wireguard-ip: 10.4.0.1/16 node.alpha.kubernetes.io/ttl: 0 volumes.kubernetes.io/controller-managed-attach-detach: true CreationTimestamp: Mon, 06 Sep 2021 09:50:02 +0200 Taints: <none> [...]
Parts in brackets [...] are truncation of the output for more lisibility
A node selector
can be applied on a pod using an existing label
, or a new label
created by the Kubernetes user, such as the labels we previously added on our nodes.
This is a sample yaml
file using a node selector
in a pod
:
#example.yaml
-—-
apiVersion: apps/v1
kind: Pod
metadata:
name: nginx
spec:
containers:
- name: nginx
image: nginx
nodeSelector:
provider: scaleway
The node selector
ensures that the defined pod will only be scheduled on a node matching the condition. In this example, the nginx
pod will only be scheduled on nodes with the label provider=scaleway
.
NodeAffinity
Node affinity also matches labels existing on nodes, but provides more flexibility and option in terms of the rules that are applied.
First of all, the node affinity
accepts two different policies:
- requiredDuringSchedulingIgnoredDuringExecution
- preferredDuringSchedulingIgnoredDuringExecution
As their names are self explanatory, we can easily understand that if a condition is not matched, Kubernetes might still be able to schedule pods
on nodes that do not match the conditions. It allows the definitions of preferences for pod
scheduling instead of mandatory criterions.
The file here is an example of requiredDuringSchedulingIgnoredDuringExecution
and preferredDuringSchedulingIgnoredDuringExecution
configuration.
#example.yaml
-—-
apiVersion: apps/v1
kind: Pod
metadata:
name: nginx
spec:
affinity:
nodeAffinity:
requiredDuringSchedulingIgnoredDuringExecution:
nodeSelectorTerms:
- matchExpressions:
- key: provider
operator: In
values:
- scaleway
- hetzner
preferedDuringSchedulingIgnoredDuringExecution:
- weight: 1
preference:
- matchExpressions:
- key: topology.kubernetes.io/region
operator: In
values:
- nl-ams
Containers:
- name: nginx
image: nginx
In this example, the pod
is required to run on a node with provider=scaleway
or provider=hetzner
label, and should preferably be scheduled on a node with a label topology.kubernetes.io/region=nl-ams
.
PodAffinity
The pod affinity
constraint is applied on pods
based on other pods
labels. It benefits from the two same policies as the node affinity
:
- requiredDuringSchedulingIgnoredDuringExecution
- preferredDuringSchedulingIgnoredDuringExecution
The difference with node affinity
is that instead of defining rules for the cohabitation of pods on nodes, the pod affinity
defines rules between pods, such as "pod 1
should run on the same node as pod 2
".
In the following sample file, we specify that an nginx
pod must be scheduled on any nodes containing a pod with the label app=one-per-provider
.
#example.yaml
-—-
apiVersion: apps/v1
kind: Pod
metadata:
name: nginx
spec:
affinity:
podAffinity:
requiredDuringSchedulingIgnoredDuringExecution:
- labelSelector:
matchExpressions:
- key: app
operator: In
values:
- one-per-provider
topologyKey: provider
containers:
- name: nginx
image: nginx
PodAntiAffinity
The same way Kubernetes allows us to define pod affinities
, we can also define pod anti affinities
, thus defining preferences for pods
to not cohabitate together under some conditions.
The same two policies are available:
- requiredDuringSchedulingIgnoredDuringExecution
- preferredDuringSchedulingIgnoredDuringExecution
In the following sample file, we define that nginx
pods
should ideally not be scheduled on pods
with the security=S1
label and on a node with a different value for topology.kubernetes.io/zone
labels.
#example.yaml
-—-
apiVersion: apps/v1
kind: Pod
metadata:
name: nginx
spec:
affinity:
podAntiAffinity:
preferedDuringSchedulingIgnoredDuringExecution:
- weight: 100
podAffinityTerm:
labelSelector:
matchExpressions:
- key: security
operator: In
values:
- S1
topologyKey: topology.kubernetes.io/zone
containers:
- name: nginx
image: nginx
🔥 Deploy & see: Spread our deployment across different providers
We are going to try out deploying an application across our two providers: Scaleway and Hetzner.
🔥 Let's create this antiaffinity.yaml
configuration file to create a deployment
where each pod
will be deployed on nodes with a different provider
label, and will not cohabitate with pods
that have the app=one-per-provider
label.
🔥
#antiaffinity.yaml
---
apiVersion: apps/v1
kind: Deployment
metadata:
name: one-per-provider-deploy
spec:
replicas: 2
selector:
matchLabels:
app: one-per-provider
template:
metadata:
labels:
app: one-per-provider
spec:
affinity:
podAntiAffinity:
requiredDuringSchedulingIgnoredDuringExecution:
- labelSelector:
matchExpressions:
- key: app
operator: In
values:
- one-per-provider
topologyKey: provider
containers:
- name: busytime
image: busybox
command: ["/bin/sh","-c","while true; do date; sleep 10; done"]
🔥 Apply the deployment
configuration on our cluster using the following command.
🔥 kubectl apply -f antiaffinity.yaml
Output
deployment.apps/one-per-provider-deploy created
And observe pods
that were generated and the nodes
they run on.
🔥 kubectl get pods -o custom-columns=NAME:.metadata.name,NODE:.spec.nodeName
Output
NAME NODE one-per-provider-deploy-75945bb589-6jb87 scw-kosmos-worldwide-b2db one-per-provider-deploy-75945bb589-db25x scw-kosmos-kosmos-scw-0937
By looking at the name of the different nodes, we can see the pool
name in the middle, informing that our two pods
are deployed on different pools
, and thus different nodes
.
Since we have a Scaleway node in our "worldwide" pool
, let's make sure that both our instances are on different providers. This information can be found in the deployment
configuration, by fetching the provider
label
we set on it at the beginning of the workshop.
Furthermore, one of the nodes in the "worldwide" pool
is a Scaleway node, so we want to make sure that both our pods don't actually run on Scaleway.
🔥 kubectl get nodes scw-kosmos-worldwide-b2db708b0c474decb7447e0d6 -o custom-columns=NAME:.metadata.name,PROVIDER:.metadata.labels.provider
Output
NAME PROVIDER scw-kosmos-worldwide-b2db708b0c474decb7447e0d6 hetzner
🔥 kubectl get nodes scw-kosmos-kosmos-scw-09371579edf54552b0187a95 -o custom-columns=NAME:.metadata.name,PROVIDER:.metadata.labels.provider
Output
NAME PROVIDER scw-kosmos-kosmos-scw-09371579edf54552b0187a95 scaleway
When we ask for the provider
label
of our two nodes, we can confirm that our two pods
from the deployment (that have the same app=one-per-provider
label
) were scheduled on different providers.
🔥 Scaling up
Our previous deployment
defined two replicas
where each generated pod
is labeled app=one-per-provider
. These pods
should be scheduled on nodes
with different values for their provider label
( topologyKey
field).
As our cluster has only two different providers, and all our pods
were created within the one-per-provider-deploy deployment
, scaling up the deployment
should not result in the scheduling of a third pod
.
So let's try it by adding only one more replica
.
🔥 kubectl scale deployment one-per-provider-deploy --replicas=3
Output
deployment.apps/one-per-provider-deploy scaled
Once the deployment
has scaled up, we can list our pods
and see what happened.
🔥 kubectl get pods
Output
NAME READY STATUS RESTARTS AGE one-per-provider-deploy-7594-29wr7 0/1 Pending 0 7s one-per-provider-deploy-7594-6jb87 1/1 Running 0 12m one-per-provider-deploy-7594-db25x 1/1 Running 0 12m
The third pod
is present in our cluster as it was required by the deployment replicaset
, but we can see that it is stuck in pending
state, meaning it the pod
could not find a node
to be scheduled on.
The reason behind this behavior is that the pod
is waiting for a node
to match all of its pod affinity constraints
, and there are currently no other nodes from a third Cloud provider in our cluster. Until a new node
with this requirement is added to the cluster, our third pod
will remain unavailable and in a pending
state.
Taints
A taint
is a Kubernetes concept used to block pods
from running on certain nodes.
The principle is to define key/value pairs completed by an effect
(i.e. a taint policy). There are three possibilities:
- NoSchedule: Forbids
pod
scheduling on thenode
but allows the execution. - PreferNoSchedule: Allows execution, and forbids
pod
scheduling on thenode
, except if nonode
can match this policy. - NoExecute: Forbids
pod
execution on thenode
, resulting in the eviction of unauthorized runningpods
.
Example
user@local:~$ kubectl taint nodes tainted-node key1=value1:NoSchedule
In this example, a taint
is applied to the node named tainted-node
, and set with the effect
(i.e. constraint or policy) NoSchedule
.
This means that no pod
has permission to be scheduled on a tainted-node
, except for pods
with a specific authorization to do so. These authorizations are called tolerations
and are covered in the next section of this article.
If a pod
without the corresponding toleration
is already running on a tainted-node
at the time the taint
is added (i.e. when the kubectl
command above is executed), the pod
will not be evicted and will keep running on this node.
However, if the constraint was set to NoExecute
, any pods
without the corresponding toleration
would not be allowed to run on the tainted-node
, resulting in its eviction.
Tolerations
As stated before, tolerations
are applied on pods
to allow exceptions on tainted nodes. Two policies are available:
- Equal: matches the
effect
of anode taint
exactly. - Exists: matches the existence of a
node taint
regardless of its value.
In this example, the pod
named busybox
is granted permission to be scheduled on a tainted node with two taints
:
-
key1=value1:NoSchedule
key2
withNoSchedule effect
regardless of thevalue
attributed to thetaint
.
# example-equal.yaml
--—
apiVersion: v1
kind: Pod
metadata:
name: busybox
spec:
containers:
- name: busybox
image: busybox
command: ["/bin/sh","-c","sleep 3600"]
tolerations:
- key: key1
operator: Equal
value: "value1"
effect: NoSchedule
- key: key2
operator: Exists
effect: NoSchedule
Taints
and tolerations
can converge to define very specific behaviors for the scheduling and execution of pods
.
Forbidding execution
To experiment with taints
, we are going to taint our managed Scaleway node with autoscale=true:Noschedule
.
As this node is part of a managed pool of our cluster, it benefits from the auto-scaling feature. We want grant permission only to pods
that are configured to run on an auto-scalable pool.
We also want to exclude (i.e. evict) all running pods which do not have the toleration from this node.
Let's have a look at our cluster status by listing our pods
and the nodes
they run on.
🔥 kubectl get pods -o custom-columns=NAME:.metadata.name,NODE:.spec.nodeName
Output
NAME NODE one-per-provider-deploy-7594-29wr7 <none> one-per-provider-deploy-7594-6jb87 scw-kosmos-worldwide-b2db one-per-provider-deploy-7594-db25x scw-kosmos-kosmos-scw-0937
We still have the same three pods
, two of which are running
, and one pending
(as no node is attributed to it).
Our managed Scaleway pool
has auto-scaling activated, using the preset label
autoscale=true
. We are, therefore, going to use the same label to forbid scheduling on this specific node.
🔥 kubectl taint nodes scw-kosmos-kosmos-scw-09371579edf54552b0187a95 autoscale=true:NoSchedule
Output
node/scw-kosmos-kosmos-scw-09371579edf54552b0187a95 tainted
See what happened on our cluster after applying the taint
, below:
🔥 kubectl get pods -o custom-columns=NAME:.metadata.name,NODE:.spec.nodeName,STATUS:.status.phase
Output
NAME NODE STATUS one-per-provider-deploy-75-29wr7 <none> Pending one-per-provider-deploy-75-6jb87 scw-kosmos-worldwide-b2db Running one-per-provider-deploy-75-db25x scw-kosmos-kosmos-scw-0937 Running
The state of our cluster has not changed. The reason for that is that the taint
we added concerned scheduling, and our pods
were already scheduled on our nodes
at the time the taint
was added.
Also, our taint
forbid scheduling, but it did not forbid the execution of the pod
.
Now, let's add a new taint to the same node. This time, however, we will set its effect to NoExecute
.
🔥 kubectl taint nodes scw-kosmos-kosmos-scw-09371579edf54552b0187a95 autoscale=true:NoExecute
Output
node/scw-kosmos-kosmos-scw-09371579edf54552b0187a95 tainted
Once this new taint
is applied, we want to observe the behavior of our pods
.
🔥 kubectl get pods -o custom-columns=NAME:.metadata.name,NODE:.spec.nodeName,STATUS:.status.phase
Output
NAME NODE STATUS one-per-provider-deploy-75-29wr7 <none> Pending one-per-provider-deploy-75-6jb87 scw-kosmos-worldwide-b2db Running one-per-provider-deploy-75-vjxkq scw-kosmos-worldwide-5ecd Running
If we look closely at the node column of this output, we can see that node scw-kosmos-kosmos-scw-0937
no longer has a pod
running on it. This pod
was evicted when the taint
with NoExecute
effect was applied.
To maintain the stability of our cluster, the replicaset
of our one-per-provider-deploy deployment
rescheduled a new pod
on a node without the incompatible taints
.
A new pod
was created in pending state while the evicted pod
was in a Terminating
status. The pod
was rescheduled to a node
that matched the taints
' conditions (the "do not execute on node
with the autoscaling" label
, but with the condition of being on a different provider using the node selector
provider label
).
Moving on to tolerations
, we will create a pod
with a toleration
which allows it to schedule on our managed Scaleway node based on its location label topology.kubernetes.io/region=nl-ams
(this label was setup directly by the Scaleway Kubernetes engine during the managed pool creation).
🔥
#toleration.yaml
---
apiVersion: v1
kind: Pod
metadata:
name: busytolerant
spec:
containers:
- name: busytolerant
image: busybox
command: ["/bin/sh","-c","sleep 3600"]
tolerations:
- key: autoscale
operator: Equal
value: "true"
effect: NoSchedule
- key: autoscale
operator: Equal
value: "true"
effect: NoExecute
nodeSelector:
topology.kubernetes.io/region: "nl-ams"
This yaml
file defines a pod
able to run on a node with the following conditions:
- node has the
taint autoscale=true:NoSchedule
- node has the
taint autoscale=true:NoExecute
- node has the
label topology.kubernetes.io/region=nl-ams
Let's apply this configuration to our cluster and observe what happens.
🔥 kubectl apply -f toleration.yaml
Output
pod/busytolerant created
🔥 kubectl get pods -o custom-columns=NAME:.metadata.name,NODE:.spec.nodeName,STATUS:.status.phase
Output
NAME NODE STATUS busytolerant scw-kosmos-kosmos-scw-0937 Running one-per-provider-deploy-75-29wr7 <none> Pending one-per-provider-deploy-75-6jb87 scw-kosmos-worldwide-b2db Running one-per-provider-deploy-75-vjxkq scw-kosmos-worldwide-5ecd Running
We can see that with the right tolerations
, the pod
named busytolerant
was perfectly able to be scheduled and executed on the node we tainted previously.
The addition of the constraint on the region label
is just a way to show how all the workload distribution features Kubernetes offers are cumulative.
🔥 Removing the taints before moving forward
To avoid scheduling issues while moving forward in this Hands-On, it is best to remove the taints
applied on our node. The command to do so is the same as the one to add the taint
, with just the addition of the -
(dash) character at the end of the taint
declaration .
🔥 kubectl taint nodes scw-kosmos-kosmos-scw-09371579edf54552b0187a95 autoscale=true:NoSchedule-
Output
node/scw-kosmos-kosmos-scw-09371579edf54552b0187a95 untainted
🔥 kubectl taint nodes scw-kosmos-kosmos-scw-09371579edf54552b0187a95 autoscale=true:NoExecute-
Output
node/scw-kosmos-kosmos-scw-09371579edf54552b0187a95 untainted
We can observe that removing the taints
did not have an effect on the pods running in our cluster. This happens because tolerations
are rules for authorization and not forbidding instructions (by opposition to taints
).
🔥 kubectl get pods -o custom-columns=NAME:.metadata.name,NODE:.spec.nodeName,STATUS:.status.phase
Output
NAME NODE STATUS busytolerant scw-kosmos-kosmos-scw-0937 Running one-per-provider-deploy-75-29wr7 <none> Pending one-per-provider-deploy-75-6jb87 scw-kosmos-worldwide-b2db Running one-per-provider-deploy-75-vjxkq scw-kosmos-worldwide-5ecd Running
Let's keep cleaning our environment and remove our busytolerant pod
and our one-per-provider-deploy deployment
one by one.
🔥 kubectl delete pods busytolerant
Output
pod "busytolerant" deleted
🔥 kubectl delete deployment one-per-provider-deploy
Output
deployment.apps "one-per-provider-deploy" deleted
🔥 kubectl get all
Output
NAME TYPE CLUSTER-IP EXTERNAL-IP PORT(S) AGE service/kubernetes ClusterIP 10.32.0.1 <none> 443/TCP 107m
PodTopologySpread constraint
The pod topology spread
constraint aims to evenly distribute pods
across nodes
based on specific rules and constraints.
It allows to set a maximum difference of a number of similar pods
between the nodes (maxSkew
parameter) and to determine the action that should be performed if the constraint cannot be met:
- DoNotSchedule: hard constraint, the
pod
cannot be scheduled - ScheduleAnyway: soft constraint, the
pod
can be scheduled if the conditions are not matched.
The sample file below shows the type of configuration to apply a topologySpreadConstraint
on pods
created from a deployment
.
# example.yaml
---
apiVersion: apps/v1
kind: Deployment
metadata:
name: busy-topologyspread
spec:
replicas: 10
selector:
matchLabels:
app: busybox-acrossproviders
template:
metadata:
labels:
app: busybox-acrossproviders
spec:
topologySpreadConstraints:
- maxSkew: 1
topologyKey: provider
whenUnsatisfiable: DoNotSchedule
labelSelector:
matchLabels:
app: busybox-acrossproviders
containers:
- name: busybox-everywhere
image: busybox
command: ["/bin/sh","-c","sleep 3600"]
🔥 Distributing our workload
The topology spread constraint
is specifically useful to spread the workload of one or multiple applications evenly throughout a Kubernetes cluster.
🔥 We are going to define a spread.yaml
file to setup a deployment with ten replicas, but which should be scheduled evenly between nodes with the following labels
: provider=scaleway
and provider=hetzner
.
We are authorize a difference of only one pod
between our matching nodes using the maxSkew
parameter:
🔥
#spread.yaml
---
apiVersion: apps/v1
kind: Deployment
metadata:
name: busyspread
spec:
replicas: 10
selector:
matchLabels:
app: busyspread-providers
template:
metadata:
labels:
app: busyspread-providers
spec:
topologySpreadConstraints:
- maxSkew: 1
topologyKey: provider
whenUnsatisfiable: DoNotSchedule
labelSelector:
matchLabels:
app: busyspread-providers
containers:
- name: busyspread
image: busybox
command: ["/bin/sh","-c","sleep 3600"]
🔥 Let's apply this deployment
.
🔥 kubectl apply -f spread.yaml
Output
deployment.apps/busyspread created
To see the distribution of our pods across the nodes of our cluster, we are going to list our busyspread pods
, the nodes
they run on, and count the number of occurences.
🔥 kubectl get pods -o wide --no-headers | grep busyspread | awk '{print $7}' | sort | uniq -c
Output
2 scw-kosmos-kosmos-scw-09371579edf54552b0187a95
3 scw-kosmos-worldwide-5ecdb6d02cf84d63937af45a6
5 scw-kosmos-worldwide-b2db708b0c474decb7447e0d6
Knowing that the two first nodes in this list have the label provider=scaleway
and the third one has a label provider=hetzner
, we have indeed an even distribution of our workload across our providers with five pods for each of them.
The next step of this Hands-On will be to set up Load Balancing and Storage management within a Multi-Cloud Kubernetes cluster.
🔥 In order to avoid getting mixed up in all our pods and deployments, we are going to clean our enviromnent by deleting our busyspread deployment
.
🔥 kubectl delete deployment busyspread
Output
deployment.apps "busyspread" deleted