Onboard onto DevZero Karp for EKS

dz-karpenter hero image

This guide will help you onboard onto DevZero Karp in an EKS cluster, we make the following assumptions:

You will use an existing EKS cluster
The EKS cluster has dakr-operator installed
You will use existing VPC and subnets
You will use existing security groups
Your nodes are part of one or more node groups
Your workloads have pod disruption budgets that adhere to EKS best practices
Your cluster has an OIDC provider for service accounts

This guide will also assume you have the aws CLI installed. You can also perform many of these steps in the console, but we will use the command line for simplicity.

Set Environment Variables

First, set a variable for your cluster name

export CLUSTER_NAME=<your cluster name>

Next, set other variables from your cluster configuration.

export KARPENTER_NAMESPACE=kube-system
export AWS_PARTITION="aws" 
export AWS_REGION="$(aws configure list | grep region | tr -s " " | cut -d" " -f3)"
export OIDC_ENDPOINT="$(aws eks describe-cluster --name "${CLUSTER_NAME}" \
    --query "cluster.identity.oidc.issuer" --output text)"
export OIDC_PROVIDER_ID=$(echo $OIDC_ENDPOINT | cut -d'/' -f5)
export AWS_ACCOUNT_ID=$(aws sts get-caller-identity --query 'Account' \
    --output text)
export TEMPOUT="$(mktemp)"
export KARPENTER_VERSION="1.7.1"
export K8S_VERSION=$(aws eks describe-cluster --name "${CLUSTER_NAME}" --query "cluster.version" --output text)

Run CloudFormation

Run the below CloudFormation script to configure AWS IAM roles and policies for node management operations, as well as set up a queue for spot interruption events.

curl -fsSL https://raw.githubusercontent.com/devzero-inc/dakr-operator-installers/refs/tags/dzkarp/dzKarp/cloudformation.yaml  > "${TEMPOUT}" \
&& aws cloudformation deploy \
  --stack-name "Karpenter-${CLUSTER_NAME}" \
  --template-file "${TEMPOUT}" \
  --capabilities CAPABILITY_NAMED_IAM \
  --parameter-overrides \
    "ClusterName=${CLUSTER_NAME}" \
    "AWSRegion=${AWS_REGION}" \
    "OIDCProviderID=${OIDC_PROVIDER_ID}" \
    "KarpenterNamespace=${KARPENTER_NAMESPACE}"

Add Tags to Subnets and Security Groups

We need to add tags to our subnets so dzKarp will know which subnets and security groups to use.

VPC_ID=$(aws eks describe-cluster --name "${CLUSTER_NAME}" --query "cluster.resourcesVpcConfig.vpcId" --output text)
aws ec2 create-tags \
    --resources $(aws ec2 describe-subnets --filters "Name=vpc-id,Values=${VPC_ID}" --query "Subnets[].SubnetId" --output text) \
    --tags "Key=karpenter.sh/discovery,Value=${CLUSTER_NAME}"

Add tags to our cluster security group.

aws ec2 create-tags \
    --tags "Key=karpenter.sh/discovery,Value=${CLUSTER_NAME}" \
    --resources $(aws eks describe-cluster --name "${CLUSTER_NAME}" --query "cluster.resourcesVpcConfig.clusterSecurityGroupId" --output text)

Update aws-auth ConfigMap

We need to allow nodes that are using the node IAM role we just created to join the cluster. To do that we have to modify the aws-auth ConfigMap in the cluster.

kubectl edit configmap aws-auth -n kube-system

You will need to add a section to the mapRoles that looks something like this. Replace the ${AWS_PARTITION} variable with the account partition, ${AWS_ACCOUNT_ID} variable with your account ID, and ${CLUSTER_NAME} variable with the cluster name, but do not replace the {{EC2PrivateDNSName}}.

- groups:
  - system:bootstrappers
  - system:nodes
  rolearn: arn:${AWS_PARTITION}:iam::${AWS_ACCOUNT_ID}:role/KarpenterNodeRole-${CLUSTER_NAME}
  username: system:node:{{EC2PrivateDNSName}}

The full aws-auth configmap should have two groups. One for your dzKarp node role and one for your existing node group.

Deploy dzKarp

Install dzKarp via Helm

# Logout of helm registry to perform an unauthenticated pull against the public ECR
helm registry logout public.ecr.aws

helm upgrade --install karpenter oci://public.ecr.aws/devzeroinc/karpenter --version "${KARPENTER_VERSION}" --namespace "${KARPENTER_NAMESPACE}" --create-namespace \
  --set "settings.clusterName=${CLUSTER_NAME}" \
  --set "settings.interruptionQueue=${CLUSTER_NAME}" \
  --set "serviceAccount.annotations.eks\.amazonaws\.com/role-arn=arn:${AWS_PARTITION}:iam::${AWS_ACCOUNT_ID}:role/KarpenterControllerRole-${CLUSTER_NAME}" \
  --set controller.resources.requests.cpu=1 \
  --set controller.resources.requests.memory=1Gi \
  --set controller.resources.limits.cpu=1 \
  --set controller.resources.limits.memory=1Gi \
  --wait

Verify dzKarp

kubectl logs -f -n "${KARPENTER_NAMESPACE}" -l app.kubernetes.io/name=karpenter -c controller

see that no unexpected errors are produced

Set nodeAffinity for critical workloads (optional)

Autoscaled nodes can be prone to churn and result in workload disturbance.

You may want to set a nodeAffinity critical cluster workloads to mitigate this.

Some examples are

coredns
metric-server

add the following to your cluster critical workload deployments

affinity:
  nodeAffinity:
    requiredDuringSchedulingIgnoredDuringExecution:
      nodeSelectorTerms:
      - matchExpressions:
        - key: karpenter.sh/nodepool
          operator: DoesNotExist

Create Node Policy

We need to create a Node Policy in DevZero, and have it target the cluster on which dzKarp was just installed.

Head over to the optimization dashboard, click on "Create Node Policy" and follow the form to create a policy suitable for your needs.

After the Policy is created, click on it in the menu and point it at the cluster you just created via "Create Target".

In about a minute this should create nodepool and nodeclass objects in your kubernetes cluster.

Check them out:

kubectl describe ec2nodeclass
kubectl describe nodepools

Migrate workloads onto autoscaled nodes

If your workloads do not have pod disruption budgets set, the following commands will cause periods of workload unavailability

If you have cluster-autoscaler installed, it must be disabled first, scale its deployment down to zero before you proceed.

To get rid of the instances that were added from the node group we can scale our nodegroup down to a minimum size to support dzKarp and other critical services.

If you have a single multi-AZ node group, we suggest having 2 instances

aws eks update-nodegroup-config --cluster-name "${CLUSTER_NAME}" \
    --nodegroup-name "${NODEGROUP}" \
    --scaling-config "minSize=2,maxSize=2,desiredSize=2"

Or, if you have multiple single-AZ node groups, we suggest 1 instance each.

for NODEGROUP in $(aws eks list-nodegroups --cluster-name "${CLUSTER_NAME}" \
    --query 'nodegroups' --output text); do aws eks update-nodegroup-config --cluster-name "${CLUSTER_NAME}" \
    --nodegroup-name "${NODEGROUP}" \
    --scaling-config "minSize=1,maxSize=1,desiredSize=1"
done

If you have a lot of nodes or workloads you may want to slowly scale down your node groups by a few instances at a time. It is recommended to watch the transition carefully for workloads that may not have enough replicas running or disruption budgets configured.

As nodegroup nodes are drained you can verify that dzKarp is creating nodes for your workloads.

kubectl logs -f -n "${KARPENTER_NAMESPACE}" -l app.kubernetes.io/name=karpenter -c controller

You should also see new nodes created in your cluster as the old nodes are removed.

kubectl get nodes

Onboard onto DevZero Karp for EKS

On this page