How to upgrade EKS cluster using Terraform
To upgrade EKS using terraform I have prepared 3 node clusters and deployed some workload over there. I am going to install the 1.20 version of EKS cluster. Then I will upgrade this cluster into 1.21 and then safely upgrade the worker node of the cluster. We have 2 worker group name worker-group-1 with 2 nodes and worker-group-2 with 1 worker node.
For your information I have used this repo for this demonstration.
We have an EKS cluster with version UP and running. I am going to deploy some workload for realisting cluster upgradation.
Next, update the cluster version in your eks_cluster module to the next desired version of EKS. Remember clusters managed by EKS can only be upgraded one minor version at a time, so if you are currently at 1.20, you can upgrade to 1.21.
When you run the terraform plan command you see some changes that seem unrelated to upgrading the EKS cluster version. The EKS terraform module is updated often which may cause some items to be renamed, which will show up as an update or a destroy and recreate. I highly recommend storing the kubeconfig and aws_auth_config files that are generated by this module in version control so you can always access them in case something goes wrong with the cluster update. Once you are satisfied with the changes, run terraform apply and confirm the changes again.
I have found the following resource changes during plan command.
Plan: 2 to add, 3 to change, 2 to destroy.
For me it will take almost 33 minutes. If you check into EKS cluster dashboard then you will see that your cluster version is already updated to 1.21.
But you will see the worker node version is like previous version 1.20
Now it is important to upgrade the kube proxy also to support 1.21 properly. For doing that First check which exact kubernetes version is actually installed in the cluster. From this link we found that only 1.21.2 is supported for 1.21 release. We can ensure this by running kubectl version command
Server Version: version.Info{Major:”1", Minor:”21+”, GitVersion:”v1.21.2-eks-0389ca3", GitCommit:”8a4e27b9d88142bbdd21b997b532eb6d493df6d2", GitTreeState:”clean”, BuildDate:”2021–07–31T01:34:46Z”, GoVersion:”go1.16.5", Compiler:”gc”, Platform:”linux/amd64"}
So for upgrading kube-proxy we need to upgrade the existing daemonset’s image. From this link you will find which image version is actually supported by 1.21 eks cluster.
You will find that image tag 1.21.2-eksbuild.2 is supported as a kube proxy image for 1.21 eks cluster version. Lets upgrade the image in kube-proxy daemonset.
kubectl edit daemonsets.apps kube-proxy -n kube-system
And set the image version to 1.21.2-eksbuild.2. And save the file.
You will see new deployment triggered and the kube proxy in rolling update state.
Now to upgrade our worker nodes to the new version, we will create a new worker group of nodes at the new version, and then move our pods over to them. The first step is to add a new configuration block to your worker groups configuration in terraform. Your configuration will be different depending on what features you are using of the eks module, and it is important that you copy your worker groups so that they match your old configuration. You can also point your new ami if any and point that here. You will find your AMI in this link.
Here is my new worker group:
Now apply the Terraform Plan and apply command. It will show you following including volume and other stuffs
Plan: 3 to add, 0 to change, 0 to destroy.
After running successfully this command you will find that two new node come up with version 1.21.2
Now it’s time to remove previous worker-group-2 from my nodepool. For doing that I need to make the node cordon so that all workload will migrate into the new node. But before that if you have any running cluster autoscaler then you need to set replica to zero otherwise it will hamper your cluster scaling.
kubectl scale deployments/cluster-autoscaler — replicas=0 -n kube-system
Then find out your nodes list from your AWS dashboard associated with worker-group-2. In my case ip-10–0–3–179.ec2.internal is the candidate.
So make the node unschedulable first by cordon the node
kubectl cordon ip-10–0–3–179.ec2.internal
Now check the status of the node
Now it’s time to drain the old nodes and force the pods to move to new nodes. I recommend doing this one node at a time to ensure that everything goes smoothly, especially in a production cluster:
kubectl drain ip-10–0–3–179.ec2.internal — ignore-daemonsets — delete-local-data
If you get all pods then you will see all workload running as it is without any interruption. And no pods are running in our legacy node.
If you have multiple nodes in your node group then you need to repeat this task for each and every node. Remember to create a new nodegroup with at least an exact number of nodes that you have in your legacy nodegroup.
Now Once you have confirmed that all non-DaemonSet pods are running on the new nodes, we can terminate your old worker group. Since the eks module uses an ordered array of worker group config objects in the worker groups key, you cannot just delete the old config. Terraform will see this change and assume that the order must have changed and try to recreate the AutoScaling groups. Instead, we should recognize that this will not be the last time we will do an upgrade, and that empty AutoScaling groups are free. So we will keep the old worker group configuration and just run 0 capacity in it like so:
Run terraform plan and terraform apply. After successful rollout you will find the following. Find that there is no worker-group-2 available here.
Repeat the same steps for Worker-group-1 also. This will help you to upgrade your worker node without downtime. Next time when we upgrade, we can put the new config in this worker group and easily spin up workers without worrying about the terraform state. Finally, if you scaled down your cluster-autoscaler, you can revert those changes so that auto scaling works properly again.