This article will guide you through the best practices to deploy and distribute the workload on a multi-cloud Kubernetes environment on Scaleway's Kosmos.

It follows the first part of the Hands-On prepared for Devoxx Poland 2021: Best practices to configure a multi-cloud Kubernetes cluster that we wanted to make available for everyone.

⚠️ Warning reminder

This article will balance between concept explanations and operations or commands that need to be performed by the reader.

If this icon (🔥) is present before an image, a command, or a file, you are required to perform an action.

So remember, when 🔥 is on, so are you!


Redundancy

🔥 Labels

First, we are going to start by listing your nodes, and more specifically their associated labels. The kubectl get nodes --show-labels command will perform this action for us.

🔥 kubectl get nodes --show-labels --no-headers | awk '{print "NODE NAME: "$1","$6"\n"}' | tr "," "\n"

Output

NODE NAME: scw-kosmos-kosmos-scw-09371579edf54552b0187a95
beta.kubernetes.io/arch=amd64
beta.kubernetes.io/instance-type=DEV1-M
beta.kubernetes.io/os=linux
failure-domain.beta.kubernetes.io/region=nl-ams
failure-domain.beta.kubernetes.io/zone=nl-ams-1
k8s.scaleway.com/kapsule=b58ad1f6-2a4d-4c0b-8573-459fad62682f
k8s.scaleway.com/managed=true
k8s.scaleway.com/node=09371579-edf5-4552-b018-7a95e779b70e
k8s.scaleway.com/pool-name=kosmos-scw
k8s.scaleway.com/pool=313ccb19-0233-4dc9-b582-b1e687903b7a
k8s.scaleway.com/runtime=containerd
kubernetes.io/arch=amd64
kubernetes.io/hostname=scw-kosmos-kosmos-scw-09371579edf54552b0187a95
kubernetes.io/os=linux
node.kubernetes.io/instance-type=DEV1-M
topology.csi.scaleway.com/zone=nl-ams-1
topology.kubernetes.io/region=nl-ams
topology.kubernetes.io/zone=nl-ams-1

NODE NAME: scw-kosmos-worldwide-5ecdb6d02cf84d63937af45a6
beta.kubernetes.io/arch=amd64
beta.kubernetes.io/os=linux
k8s.scw.cloud/disable-lifecycle=true
k8s.scw.cloud/node-public-ip=151.115.36.196
kubernetes.io/arch=amd64
kubernetes.io/hostname=scw-kosmos-worldwide-5ecdb6d02cf84d63937af45a6
kubernetes.io/os=linux
topology.kubernetes.io/region=scw-kosmos-worldwide-5ecdb6d02cf84d63937af45a6

NODE NAME: scw-kosmos-worldwide-b2db708b0c474decb7447e0d6
beta.kubernetes.io/arch=amd64
beta.kubernetes.io/os=linux
k8s.scw.cloud/disable-lifecycle=true
k8s.scw.cloud/node-public-ip=65.21.146.191
kubernetes.io/arch=amd64
kubernetes.io/hostname=scw-kosmos-worldwide-b2db708b0c474decb7447e0d6
kubernetes.io/os=linux
topology.kubernetes.io/region=scw-kosmos-worldwide-b2db708b0c474decb7447e0d6

For each of our three nodes, we see many labels. The first node on the list has considerably more labels as it is managed by a Kubernetes Kosmos engine. In this case, more information about features and node management is added.

🔥 Adding labels to distinguish Cloud providers

As it might not be easy to remember which node comes from which provider, and as it can help us distribute our workload across providers, we are going to label our nodes with a label called provider with values such as scaleway or hetzner.

kubectl label nodes scw-kosmos-kosmos-scw-09371579edf54552b0187a95 provider=scaleway

kubectl label nodes scw-kosmos-worldwide-5ecdb6d02cf84d63937af45a6 provider=scaleway

kubectl label nodes scw-kosmos-worldwide-b2db708b0c474decb7447e0d6 provider=hetzner

In addition, we are also going to add label to our unmanaged Scaleway node to specify that it is, in fact, not managed by the engine. For that we use the same label used on the managed Scaleway node, but set to false: k8s.scaleway.com/managed=false.

kubectl label nodes scw-kosmos-worldwide-5ecdb6d02cf84d63937af45a6 k8s.scaleway.com/managed=false

🔥 Listing our labels

Let's list our labels to ensure that the provider label is well set on our three nodes.

🔥 kubectl get nodes --show-labels --no-headers | awk '{print "NODE NAME: "$1","$6"\n"}' | tr "," "\n"

Output

NODE NAME: scw-kosmos-kosmos-scw-09371579edf54552b0187a95
beta.kubernetes.io/arch=amd64
beta.kubernetes.io/instance-type=DEV1-M
beta.kubernetes.io/os=linux
failure-domain.beta.kubernetes.io/region=nl-ams
failure-domain.beta.kubernetes.io/zone=nl-ams-1
k8s.scaleway.com/kapsule=b58ad1f6-2a4d-4c0b-8573-459fad62682f
k8s.scaleway.com/managed=true
k8s.scaleway.com/node=09371579-edf5-4552-b018-7a95e779b70e
k8s.scaleway.com/pool-name=kosmos-scw
k8s.scaleway.com/pool=313ccb19-0233-4dc9-b582-b1e687903b7a
k8s.scaleway.com/runtime=containerd
kubernetes.io/arch=amd64
kubernetes.io/hostname=scw-kosmos-kosmos-scw-09371579edf54552b0187a95
kubernetes.io/os=linux
node.kubernetes.io/instance-type=DEV1-M
provider=scaleway
topology.csi.scaleway.com/zone=nl-ams-1
topology.kubernetes.io/region=nl-ams
topology.kubernetes.io/zone=nl-ams-1

NODE NAME: scw-kosmos-worldwide-5ecdb6d02cf84d63937af45a6
beta.kubernetes.io/arch=amd64
beta.kubernetes.io/os=linux
k8s.scaleway.com/managed=false
k8s.scw.cloud/disable-lifecycle=true
k8s.scw.cloud/node-public-ip=151.115.36.196
kubernetes.io/arch=amd64
kubernetes.io/hostname=scw-kosmos-worldwide-5ecdb6d02cf84d63937af45a6
kubernetes.io/os=linux
provider=scaleway
topology.kubernetes.io/region=scw-kosmos-worldwide-5ecdb6d02cf84d63937af45a6

NODE NAME: scw-kosmos-worldwide-b2db708b0c474decb7447e0d6
beta.kubernetes.io/arch=amd64
beta.kubernetes.io/os=linux
k8s.scw.cloud/disable-lifecycle=true
k8s.scw.cloud/node-public-ip=65.21.146.191
kubernetes.io/arch=amd64
kubernetes.io/hostname=scw-kosmos-worldwide-b2db708b0c474decb7447e0d6
kubernetes.io/os=linux
provider=hetzner
topology.kubernetes.io/region=scw-kosmos-worldwide-b2db708b0c474decb7447e0d6

Deployment and observation: What happens in a Multi-Cloud cluster?

🔥 A first very simple deployment

To better understand the behavior of a Multi-Cloud Kubernetes cluster, we are going to create a very simple deployment using kubectl. This deployment will run three replicas of the busybox image, each of which will print the date every ten seconds.

kubectl create deploy first-deployment --replicas=3 --image=busybox -- /bin/sh -c "while true; do date; sleep 10; done"

Once the deployment has been created, we can observe what is actually happening on our cluster.

kubectl get all

Output

NAME                                  READY  STATUS   RESTARTS  AGE
pod/first-deployment-695f579bd4-cfg6l  1/1    Running  0         8s
pod/first-deployment-695f579bd4-jzft8  1/1    Running  0         8s
pod/first-deployment-695f579bd4-rt5jt  1/1    Running  0         8s

NAME                TYPE       CLUSTER-IP  EXTERNAL-IP  PORT(S)  AGE
service/kubernetes  ClusterIP  10.32.0.1   <none>       443/TCP  53m

NAME                             READY  UP-TO-DATE  AVAILABLE  AGE
deployment.apps/first-deployment  3/3    3           3          8s

NAME                                       DESIRED CURRENT READY AGE
replicaset.apps/first-deployment-695f579bd4 3       3       3     8s

Our first observation is that our deployment object is here, along with the three pods (replicas) we asked for. We can also observe that another "unexpected" object was also created, a replicaset. The replicaset is an intermediary object created by the deployment in charge of maintaining and monitoring the replicas.

Now, let's have a quick look inside one of our pods to see if it performs normally.

🔥 kubectl logs pod/first-deployment-695f579bd4-cfg6l

Output

Mon Sep  6 08:41:01 UTC 2021
Mon Sep  6 08:41:11 UTC 2021
Mon Sep  6 08:41:21 UTC 2021
Mon Sep  6 08:41:31 UTC 2021

We can see that our pod is writing the date every ten seconds, which is exactly what we asked it to do.

Now, the real question is, where are these pods running? We can use the kubectl get pods to give us the name of the node where they actually run.

🔥 kubectl get pods -o custom-columns=NAME:.metadata.name,NODE:.spec.nodeName

Output

NAME                                NODE
first-deployment-695f579bd4-cfg6l   scw-kosmos-kosmos-scw-0937
first-deployment-695f579bd4-jzft8   scw-kosmos-kosmos-scw-0937
first-deployment-695f579bd4-rt5jt   scw-kosmos-kosmos-scw-0937

When listing our three pods and their location, it seems that they all run on the same managed Scaleway node (the one located in Amsterdam). That's unfortunate... Let's see if we can act on this behavior.

🔥 Scaling up

The first thing we can try is to scale up our deployment and see where all our new replicas will be scheduled.

🔥 kubectl scale deployment first-deployment --replicas=15

The scaling has been applied, we can list our pods again.

🔥 kubectl get pods -o custom-columns=NAME:.metadata.name,NODE:.spec.nodeName

Output

NAME                                NODE
first-deployment-695f579bd4-5jq9q   scw-kosmos-kosmos-scw-0937
first-deployment-695f579bd4-5t6tw   scw-kosmos-kosmos-scw-0937
first-deployment-695f579bd4-5twcj   scw-kosmos-kosmos-scw-0937
first-deployment-695f579bd4-5xljr   scw-kosmos-kosmos-scw-0937
first-deployment-695f579bd4-8phq5   scw-kosmos-kosmos-scw-0937
first-deployment-695f579bd4-cfg6l   scw-kosmos-kosmos-scw-0937
first-deployment-695f579bd4-jzft8   scw-kosmos-kosmos-scw-0937
first-deployment-695f579bd4-nf9fg   scw-kosmos-kosmos-scw-0937
first-deployment-695f579bd4-nsxb6   scw-kosmos-kosmos-scw-0937
first-deployment-695f579bd4-ptlkp   scw-kosmos-kosmos-scw-0937
first-deployment-695f579bd4-rgdqj   scw-kosmos-kosmos-scw-0937
first-deployment-695f579bd4-rt5jt   scw-kosmos-kosmos-scw-0937
first-deployment-695f579bd4-vrl95   scw-kosmos-kosmos-scw-0937
first-deployment-695f579bd4-vwv7l   scw-kosmos-kosmos-scw-0937
first-deployment-695f579bd4-w9qqq   scw-kosmos-kosmos-scw-0937

And they are still all running on the same node.

To go further, we are going to play with more complex configuration. In order to do so without getting mixed up with our other configurations, deployments and pods, it is best to clean our environment and delete our deployment.

🔥 kubectl delete deployment first-deployment

Output
deployment.apps "first-deployment" deleted

Yaml files

Kubectl commands are nice, but when it comes to managing multiple Kubernetes objects, configuration files are a better and more reliable fit‌. In Kubernetes, configurations are made in yaml format, always following a pattern similar to the one below:

#example.yaml
-—-
apiVersion: apps/v1 # version of the k8s api
kind: Pod           # type of the Kubernetes object we aim to describe
metadata:           # additional options such as the object name, labels, annotations
  …
spec:               # parameters and options of the k8s object to create
  …

Selecting where to run our pods

In Kubernetes, there are different options available to distribute our workload across nodes, namespaces, or depending on affinity, between pods. Working in a Multi-Cloud Kubernetes environmnent makes their usage mandatory and knowing them and their behavior can rapidly become crucial.

🔥 NodeSelector

A node selector is applied on a pod and will match labels that exist on the cluster nodes. The command below gives us all information about a given node, including labels, annotations, running pods, etc...

🔥 kubectl describe node scw-kosmos-kosmos-scw-09371579edf54552b0187a95

Output

Name:               scw-kosmos-kosmos-scw-09371579edf54552b0187a95
Roles:              <none>
Labels:             beta.kubernetes.io/arch=amd64
                    beta.kubernetes.io/instance-type=DEV1-M
                    beta.kubernetes.io/os=linux
                    failure-domain.beta.kubernetes.io/region=nl-ams
                    failure-domain.beta.kubernetes.io/zone=nl-ams-1
                    k8s.scaleway.com/kapsule=b58a[...]
                    k8s.scaleway.com/managed=true
                    k8s.scaleway.com/node=0937[...]
                    k8s.scaleway.com/pool=313c[...]
                    k8s.scaleway.com/pool-name=kosmos-scw
                    k8s.scaleway.com/runtime=containerd
                    kubernetes.io/arch=amd64
                    kubernetes.io/hostname=scw-kosmos-kosmos-scw-0937[...]
                    kubernetes.io/os=linux
                    node.kubernetes.io/instance-type=DEV1-M
                    provider=scaleway
                    topology.csi.scaleway.com/zone=nl-ams-1
                    topology.kubernetes.io/region=nl-ams
                    topology.kubernetes.io/zone=nl-ams-1
Annotations:        csi.volume.kubernetes.io/nodeid: {"csi.scaleway.com":"[...]"}
                    kilo.squat.ai/discovered-endpoints: {}
                    kilo.squat.ai/endpoint: 51.15.123.156:51820
                    kilo.squat.ai/force-endpoint: 51.15.123.156:51820
                    kilo.squat.ai/granularity: location
                    kilo.squat.ai/internal-ip: 10.67.36.37/31
                    kilo.squat.ai/key: cSP2[...]
                    kilo.squat.ai/last-seen: 1630917821
                    kilo.squat.ai/wireguard-ip: 10.4.0.1/16
                    node.alpha.kubernetes.io/ttl: 0
                    volumes.kubernetes.io/controller-managed-attach-detach: true
CreationTimestamp:  Mon, 06 Sep 2021 09:50:02 +0200
Taints:             <none>
[...]

Parts in brackets [...] are truncation of the output for more lisibility

A node selector can be applied on a pod using an existing label, or a new label created by the Kubernetes user, such as the labels we previously added on our nodes.

This is a sample yaml file using a node selector in a pod:

#example.yaml
-—-
apiVersion: apps/v1
kind: Pod
metadata:
  name: nginx
spec:
  containers:
  - name: nginx
    image: nginx
  nodeSelector:
    provider: scaleway

The node selector ensures that the defined pod will only be scheduled on a node matching the condition. In this example, the nginx pod will only be scheduled on nodes with the label provider=scaleway.

NodeAffinity

Node affinity also matches labels existing on nodes, but provides more flexibility and option in terms of the rules that are applied.

First of all, the node affinity accepts two different policies:

  • requiredDuringSchedulingIgnoredDuringExecution‌‌
  • preferredDuringSchedulingIgnoredDuringExecution

As their names are self explanatory, we can easily understand that if a condition is not matched, Kubernetes might still be able to schedule pods on nodes that do not match the conditions. It allows the definitions of preferences for pod scheduling instead of mandatory criterions.

The file here is an example of requiredDuringSchedulingIgnoredDuringExecution‌‌ and preferredDuringSchedulingIgnoredDuringExecution configuration.

#example.yaml
-—-
apiVersion: apps/v1
kind: Pod
metadata:
  name: nginx
spec:
  affinity:
    nodeAffinity:
      requiredDuringSchedulingIgnoredDuringExecution:
        nodeSelectorTerms:
        - matchExpressions:
          - key: provider
            operator: In
            values:
            - scaleway
            - hetzner
      preferedDuringSchedulingIgnoredDuringExecution:
      - weight: 1
        preference:
        - matchExpressions:
          - key: topology.kubernetes.io/region
            operator: In
            values:
            - nl-ams
  Containers:
  - name: nginx
    image: nginx

In this example, the pod is required to run on a node with provider=scaleway or provider=hetzner label, and should preferably be scheduled on a node with a label topology.kubernetes.io/region=nl-ams.

PodAffinity

The pod affinity constraint is applied on pods based on other pods labels. It benefits from the two same policies as the node affinity:

  • requiredDuringSchedulingIgnoredDuringExecution‌‌
  • preferredDuringSchedulingIgnoredDuringExecution

The difference with node affinity is that instead of defining rules for the cohabitation of pods on nodes, the pod affinity defines rules between pods, such as "pod 1 should run on the same node as pod 2".

In the following sample file, we specify that an nginx pod must be scheduled on any nodes containing a pod with the label app=one-per-provider.

#example.yaml
-—-
apiVersion: apps/v1
kind: Pod
metadata:
  name: nginx
spec:
  affinity:
    podAffinity:
      requiredDuringSchedulingIgnoredDuringExecution:
        - labelSelector:
          matchExpressions:
          - key: app
            operator: In
            values:
            - one-per-provider
          topologyKey: provider
  containers:
  - name: nginx
    image: nginx

PodAntiAffinity

The same way Kubernetes allows us to define pod affinities, we can also define pod anti affinities, thus defining preferences for pods to not cohabitate together under some conditions.

The same two policies are available:

  • requiredDuringSchedulingIgnoredDuringExecution‌‌
  • preferredDuringSchedulingIgnoredDuringExecution

In the following sample file, we define that nginx pods should ideally not be scheduled on pods with the security=S1 label and on a node with a different value for topology.kubernetes.io/zone labels.

#example.yaml
-—-
apiVersion: apps/v1
kind: Pod
metadata:
  name: nginx
spec:
  affinity:
    podAntiAffinity:
      preferedDuringSchedulingIgnoredDuringExecution:
      - weight: 100
        podAffinityTerm:
          labelSelector:
            matchExpressions:
            - key: security
              operator: In
              values:
              - S1
            topologyKey: topology.kubernetes.io/zone
  containers:
  - name: nginx
    image: nginx

🔥 Deploy & see: Spread our deployment across different providers

We are going to try out deploying an application across our two providers: Scaleway and Hetzner.

🔥 Let's create this antiaffinity.yaml configuration file to create a deployment where each pod will be deployed on nodes with a different provider label, and will not cohabitate with pods that have the app=one-per-provider label.

🔥

#antiaffinity.yaml
---
apiVersion: apps/v1
kind: Deployment
metadata:
  name: one-per-provider-deploy
spec:
  replicas: 2
  selector:
    matchLabels:
      app: one-per-provider
  template:
    metadata:
      labels:
        app: one-per-provider
    spec:
      affinity:
        podAntiAffinity:
          requiredDuringSchedulingIgnoredDuringExecution:
            - labelSelector:
                matchExpressions:
                - key: app
                  operator: In
                  values:
                  - one-per-provider
              topologyKey: provider
      containers:
      - name: busytime
        image: busybox
        command: ["/bin/sh","-c","while true; do date; sleep 10; done"]

🔥 Apply the deployment configuration on our cluster using the following command.

🔥 kubectl apply -f antiaffinity.yaml

Output
deployment.apps/one-per-provider-deploy created

And observe pods that were generated and the nodes they run on.

🔥 kubectl get pods -o custom-columns=NAME:.metadata.name,NODE:.spec.nodeName

Output

NAME                                       NODE
one-per-provider-deploy-75945bb589-6jb87   scw-kosmos-worldwide-b2db
one-per-provider-deploy-75945bb589-db25x   scw-kosmos-kosmos-scw-0937

By looking at the name of the different nodes, we can see the pool name in the middle, informing that our two pods are deployed on different pools, and thus different nodes.

Since we have a Scaleway node in our "worldwide" pool, let's make sure that both our instances are on different providers. This information can be found in the deployment configuration, by fetching the provider label we set on it at the beginning of the workshop.

Furthermore, one of the nodes in the "worldwide" pool is a Scaleway node, so we want to make sure that both our pods don't actually run on Scaleway.

🔥 kubectl get nodes scw-kosmos-worldwide-b2db708b0c474decb7447e0d6 -o custom-columns=NAME:.metadata.name,PROVIDER:.metadata.labels.provider

Output

NAME                                             PROVIDER
scw-kosmos-worldwide-b2db708b0c474decb7447e0d6   hetzner

🔥 kubectl get nodes scw-kosmos-kosmos-scw-09371579edf54552b0187a95 -o custom-columns=NAME:.metadata.name,PROVIDER:.metadata.labels.provider

Output

NAME                                             PROVIDER
scw-kosmos-kosmos-scw-09371579edf54552b0187a95   scaleway

When we ask for the provider label of our two nodes, we can confirm that our two pods from the deployment (that have the same app=one-per-provider label) were scheduled on different providers.

🔥 Scaling up

Our previous deployment defined two replicas where each generated pod is labeled app=one-per-provider. These pods should be scheduled on nodes with different values for their provider label( topologyKey field).

As our cluster has only two different providers, and all our pods were created within the one-per-provider-deploy deployment, scaling up the deployment should not result in the scheduling of a third pod.

So let's try it by adding only one more replica.

🔥 kubectl scale deployment one-per-provider-deploy --replicas=3

Output
deployment.apps/one-per-provider-deploy scaled

Once the deployment has scaled up, we can list our pods and see what happened.

🔥 kubectl get pods

Output

NAME                                      READY  STATUS   RESTARTS AGE
one-per-provider-deploy-7594-29wr7  0/1    Pending  0        7s
one-per-provider-deploy-7594-6jb87  1/1    Running  0        12m
one-per-provider-deploy-7594-db25x  1/1    Running  0        12m

The third pod is present in our cluster as it was required by the deployment replicaset, but we can see that it is stuck in pending state, meaning it the pod could not find a node to be scheduled on.

The reason behind this behavior is that the pod is waiting for a node to match all of its pod affinity constraints, and there are currently no other nodes from a third Cloud provider in our cluster. Until a new node with this requirement is added to the cluster, our third pod will remain unavailable and in a pending state.

Taints

A taint is a Kubernetes concept used to block pods from running  on certain nodes‌‌.

The principle is to define key/value pairs completed by an effect (i.e. a taint policy). There are three possibilities:

  • NoSchedule‌‌: Forbids pod scheduling on the node but allows the execution.
  • PreferNoSchedule‌‌:  Allows execution, and forbids pod scheduling on the node, except if no node can match this policy.
  • NoExecute: Forbids pod execution on the node, resulting in the eviction of unauthorized running pods.

Example
user@local:~$ kubectl taint nodes tainted-node key1=value1:NoSchedule

In this example, a taint is applied to the node named tainted-node, and set with the effect (i.e. constraint or policy) NoSchedule.

This means that no pod has permission to be scheduled on a tainted-node, except for pods with a specific authorization to do so. These authorizations are called tolerations and are covered in the next section of this article.

If a pod without the corresponding  toleration is already running on a tainted-node at the time the taint is added (i.e. when the kubectl command above is executed), the pod will not be evicted and will keep running on this node.

However, if the constraint was set to NoExecute, any pods without the corresponding toleration would not be allowed to run on the tainted-node, resulting in its eviction.

Tolerations

As stated before, tolerations are applied on pods to allow exceptions on tainted nodes. Two policies are available:

  • Equal: matches the effect of a node taint exactly.
  • ‌‌Exists: matches the existence of a node taint regardless of its value.

In this example, the pod named busybox is granted permission to be scheduled on a tainted node with two taints:

  • key1=value1:NoSchedule
  • key2 with NoSchedule effect regardless of the value attributed to the taint.
# example-equal.yaml
--—
apiVersion: v1
kind: Pod
metadata:
  name: busybox
spec:
  containers:
  - name: busybox
    image: busybox
    command: ["/bin/sh","-c","sleep 3600"]
  tolerations:
  - key: key1
    operator: Equal
    value: "value1"
    effect: NoSchedule
  - key: key2
    operator: Exists
    effect: NoSchedule

Taints and tolerations can converge to define very specific behaviors for the scheduling and execution of pods.

Forbidding execution

To experiment with taints, we are going to taint our managed Scaleway node with autoscale=true:Noschedule.
As this node is part of a managed pool of our cluster, it benefits from the auto-scaling feature. We want grant permission only to pods that are configured to run on an auto-scalable pool.
We also want to exclude (i.e. evict) all running pods which do not have the toleration from this node.

Let's have a look at our cluster status by listing our pods and the nodes they run on.

🔥 kubectl get pods -o custom-columns=NAME:.metadata.name,NODE:.spec.nodeName

Output

NAME                                       NODE
one-per-provider-deploy-7594-29wr7   <none>
one-per-provider-deploy-7594-6jb87   scw-kosmos-worldwide-b2db
one-per-provider-deploy-7594-db25x   scw-kosmos-kosmos-scw-0937

We still have the same three pods, two of which are running, and one pending (as no node is attributed to it).


Our managed Scaleway pool has auto-scaling activated, using the preset label autoscale=true. We are, therefore, going to use the same label to forbid scheduling on this specific node.

🔥 kubectl taint nodes scw-kosmos-kosmos-scw-09371579edf54552b0187a95 autoscale=true:NoSchedule

Output
node/scw-kosmos-kosmos-scw-09371579edf54552b0187a95 tainted

See what happened on our cluster after applying the taint, below:

🔥 kubectl get pods -o custom-columns=NAME:.metadata.name,NODE:.spec.nodeName,STATUS:.status.phase

Output

NAME                             NODE                       STATUS
one-per-provider-deploy-75-29wr7 <none>                     Pending
one-per-provider-deploy-75-6jb87 scw-kosmos-worldwide-b2db  Running
one-per-provider-deploy-75-db25x scw-kosmos-kosmos-scw-0937 Running

The state of our cluster has not changed. The reason for that is that the taint we added concerned scheduling, and our pods were already scheduled on our nodes at the time the taint was added.

Also, our taint forbid scheduling, but it did not forbid the execution of the pod.


Now, let's add a new taint to the same node. This time, however, we will set its effect to NoExecute.

🔥 kubectl taint nodes scw-kosmos-kosmos-scw-09371579edf54552b0187a95 autoscale=true:NoExecute

Output
node/scw-kosmos-kosmos-scw-09371579edf54552b0187a95 tainted

Once this new taint is applied, we want to observe the behavior of our pods.

🔥 kubectl get pods -o custom-columns=NAME:.metadata.name,NODE:.spec.nodeName,STATUS:.status.phase

Output

NAME                              NODE                       STATUS
one-per-provider-deploy-75-29wr7  <none>                     Pending
one-per-provider-deploy-75-6jb87  scw-kosmos-worldwide-b2db  Running
one-per-provider-deploy-75-vjxkq  scw-kosmos-worldwide-5ecd  Running

If we look closely at the node column of this output, we can see that node scw-kosmos-kosmos-scw-0937 no longer has a pod running on it. This pod was evicted when the taint with NoExecute effect was applied.

To maintain the stability of our cluster, the replicaset of our one-per-provider-deploy deployment rescheduled a new pod on a node without the incompatible taints.

A new pod was created in pending state while the evicted pod was in a Terminating status. The pod was rescheduled to a node that matched the taints' conditions (the "do not execute on node with the autoscaling" label, but with the condition of being on a different provider using the node selector provider label).


Moving on to tolerations, we will create a pod with a toleration which allows it to schedule on our managed Scaleway node based on its location label topology.kubernetes.io/region=nl-ams(this label was setup directly by the Scaleway Kubernetes engine during the managed pool creation).

🔥

#toleration.yaml
---
apiVersion: v1
kind: Pod
metadata:
  name: busytolerant
spec:
  containers:
  - name: busytolerant
    image: busybox
    command: ["/bin/sh","-c","sleep 3600"]
  tolerations:
  - key: autoscale
    operator: Equal
    value: "true"
    effect: NoSchedule
  - key: autoscale
    operator: Equal
    value: "true"
    effect: NoExecute
  nodeSelector:
    topology.kubernetes.io/region: "nl-ams"

This yaml file defines a pod able to run on a node with the following conditions:

  • node has the taint autoscale=true:NoSchedule
  • node has the taint autoscale=true:NoExecute
  • node has the label topology.kubernetes.io/region=nl-ams

Let's apply this configuration to our cluster and observe what happens.

🔥 kubectl apply -f toleration.yaml

Output
pod/busytolerant created

🔥 kubectl get pods -o custom-columns=NAME:.metadata.name,NODE:.spec.nodeName,STATUS:.status.phase

Output

NAME                               NODE                        STATUS
busytolerant                       scw-kosmos-kosmos-scw-0937  Running
one-per-provider-deploy-75-29wr7   <none>                      Pending
one-per-provider-deploy-75-6jb87   scw-kosmos-worldwide-b2db   Running
one-per-provider-deploy-75-vjxkq   scw-kosmos-worldwide-5ecd   Running

We can see that with the right tolerations, the pod named busytolerant was perfectly able to be scheduled and executed on the node we tainted previously.

The addition of the constraint on the region label is just a way to show how all the workload distribution features Kubernetes offers are cumulative.


🔥 Removing the taints before moving forward

To avoid scheduling issues while moving forward in this Hands-On, it is best to remove the taints applied on our node. The command to do so is the same as the one to add the taint, with just the addition of the - (dash) character at the end of the taint declaration .

🔥 kubectl taint nodes scw-kosmos-kosmos-scw-09371579edf54552b0187a95 autoscale=true:NoSchedule-

Output
node/scw-kosmos-kosmos-scw-09371579edf54552b0187a95 untainted

🔥 kubectl taint nodes scw-kosmos-kosmos-scw-09371579edf54552b0187a95 autoscale=true:NoExecute-

Output
node/scw-kosmos-kosmos-scw-09371579edf54552b0187a95 untainted

We can observe that removing the taints did not have an effect on the pods running in our cluster. This happens because tolerations are rules for authorization and not forbidding instructions (by opposition to taints).

🔥 kubectl get pods -o custom-columns=NAME:.metadata.name,NODE:.spec.nodeName,STATUS:.status.phase

Output

NAME                             NODE                        STATUS
busytolerant                     scw-kosmos-kosmos-scw-0937  Running
one-per-provider-deploy-75-29wr7 <none>                      Pending
one-per-provider-deploy-75-6jb87 scw-kosmos-worldwide-b2db   Running
one-per-provider-deploy-75-vjxkq scw-kosmos-worldwide-5ecd   Running

Let's keep cleaning our environment and remove our busytolerant pod and our one-per-provider-deploy deployment one by one.

🔥 kubectl delete pods busytolerant

Output
pod "busytolerant" deleted

🔥 kubectl delete deployment one-per-provider-deploy

Output
deployment.apps "one-per-provider-deploy" deleted

🔥 kubectl get all

Output

NAME                TYPE       CLUSTER-IP  EXTERNAL-IP PORT(S)  AGE
service/kubernetes  ClusterIP  10.32.0.1   <none>      443/TCP  107m

PodTopologySpread constraint

The pod topology spread constraint aims to evenly distribute pods across nodes based on specific rules and constraints.

It allows to set a maximum difference of a number of similar pods between the nodes (maxSkew parameter) and to determine the action that should be performed if the constraint cannot be met:

  • DoNotSchedule: hard constraint, the pod cannot be scheduled
  • ScheduleAnyway: soft constraint, the pod can be scheduled if the conditions are not matched.

The sample file below shows the type of configuration to apply a topologySpreadConstraint on pods created from a deployment.

# example.yaml
---
    apiVersion: apps/v1
    kind: Deployment
    metadata:
      name: busy-topologyspread
    spec:
      replicas: 10
      selector:
        matchLabels:
          app: busybox-acrossproviders
      template:
        metadata:
          labels:
            app: busybox-acrossproviders
        spec:
          topologySpreadConstraints:
          - maxSkew: 1
            topologyKey: provider
            whenUnsatisfiable: DoNotSchedule
            labelSelector:
              matchLabels:
                app: busybox-acrossproviders
          containers:
          - name: busybox-everywhere
            image: busybox
            command: ["/bin/sh","-c","sleep 3600"]

🔥 Distributing our workload

The topology spread constraint is specifically useful to spread the workload of one or multiple applications evenly throughout a Kubernetes cluster.

🔥 We are going to define a spread.yaml file to setup a deployment with ten replicas, but which should be scheduled evenly between nodes with the following labels: provider=scaleway and provider=hetzner.

We are authorize a difference of only one pod between our matching nodes using the maxSkew parameter:

🔥

#spread.yaml
---
apiVersion: apps/v1
kind: Deployment
metadata:
  name: busyspread
spec:
  replicas: 10
  selector:
    matchLabels:
      app: busyspread-providers
  template:
    metadata:
      labels:
        app: busyspread-providers
    spec:
      topologySpreadConstraints:
      - maxSkew: 1
        topologyKey: provider
        whenUnsatisfiable: DoNotSchedule
        labelSelector:
          matchLabels:
            app: busyspread-providers
      containers:
      - name: busyspread
        image: busybox
        command: ["/bin/sh","-c","sleep 3600"]

🔥 Let's apply this deployment.

🔥 kubectl apply -f spread.yaml

Output
deployment.apps/busyspread created

To see the distribution of our pods across the nodes of our cluster, we are going to list our busyspread pods, the nodes they run on, and count the number of occurences.

🔥 kubectl get pods -o wide --no-headers | grep busyspread | awk '{print $7}' | sort | uniq -c

Output
2 scw-kosmos-kosmos-scw-09371579edf54552b0187a95
3 scw-kosmos-worldwide-5ecdb6d02cf84d63937af45a6
5 scw-kosmos-worldwide-b2db708b0c474decb7447e0d6

Knowing that the two first nodes in this list have the label provider=scaleway and the third one has a label provider=hetzner, we have indeed an even distribution of our workload across our providers with five pods for each of them.


The next step of this Hands-On will be to set up Load Balancing and Storage management within a Multi-Cloud Kubernetes cluster.

🔥 In order to avoid getting mixed up in all our pods and deployments, we are going to clean our enviromnent by deleting our busyspread deployment.

🔥 kubectl delete deployment busyspread

Output
deployment.apps "busyspread" deleted

Next step