Onboard onto DevZero Karp for GKE

dz-karpenter hero image

This guide will help you onboard onto DevZero Karp in an GKE cluster, we make the following assumptions:

You will use an existing GKE cluster
The cluster has dakr-operator installed
You have gcloud, helm, kubectl installed

Set Environment Variables

export PROJECT_ID=<your-google-project-id>
export GSA_NAME=karpenter-gsa
export CLUSTER_NAME=<gke-cluster-name>
# region or zone depending on if you are running a zonal or regional cluster
export REGION=<gke-region-or-zone-name>

Create GCP Service Account

# Create Google Service Account
gcloud iam service-accounts create $GSA_NAME --project=$PROJECT_ID

# Add required IAM roles
gcloud projects add-iam-policy-binding $PROJECT_ID \
    --member="serviceAccount:$GSA_NAME@$PROJECT_ID.iam.gserviceaccount.com" \
    --role="roles/compute.admin"
gcloud projects add-iam-policy-binding $PROJECT_ID \
    --member="serviceAccount:$GSA_NAME@$PROJECT_ID.iam.gserviceaccount.com" \
    --role="roles/container.admin"
gcloud projects add-iam-policy-binding $PROJECT_ID \
    --member="serviceAccount:$GSA_NAME@$PROJECT_ID.iam.gserviceaccount.com" \
    --role="roles/iam.serviceAccountUser"

Ensure Cluster Supports Workload Identity ```bash gcloud container clusters

update $CLUSTER_NAME \ --location=$ REGION
--workload-pool=$PROJECT_ID.svc.id.goog ```

Create Dedicated Nodepool for DzKarp

gcloud container node-pools create dzkarp \
    --cluster=$CLUSTER_NAME \
    --location=$REGION \
    --workload-metadata=GKE_METADATA \
    --machine-type=e2-medium \
    --num-nodes=2

Configure Workload Identity Binding

gcloud iam service-accounts add-iam-policy-binding $GSA_NAME@$PROJECT_ID.iam.gserviceaccount.com \
    --role roles/iam.workloadIdentityUser \
    --member "serviceAccount:$PROJECT_ID.svc.id.goog[karpenter-system/karpenter]"

Deploy dzKarp

Install dzKarp via Helm

# Logout of helm registry to perform an unauthenticated pull against the public ECR
helm registry logout public.ecr.aws

helm upgrade --install karpenter oci://public.ecr.aws/devzeroinc/dzkarp-gcp/karpenter \
  --version 1.0.6 \
  --namespace karpenter-system --create-namespace \
  --set "controller.settings.projectID=${PROJECT_ID}" \
  --set "controller.settings.location=${REGION}" \
  --set "controller.settings.clusterName=${CLUSTER_NAME}" \
  --set controller.featureGates.spotToSpotConsolidation=true \
  --set controller.settings.implicitPDBMinAvailable="20%" \
  --set "credentials.enabled=false" \
  --set "serviceAccount.annotations.iam\.gke\.io/gcp-service-account=${GSA_NAME}@${PROJECT_ID}.iam.gserviceaccount.com" \
  --wait

implicitPDBMinAvailable specifies the minimal available subset of pods in any deployment or statefulset dzkarp must maintain during node disruption for which an explicit pdb is not already defined. This value can be a percentage string or a natural number (for a nominal pod count). Set this value to zero if you do not want a implicit fallback pdb.

Verify dzKarp

kubectl get pods -n karpenter-system
kubectl logs -n karpenter-system deployment/karpenter

see that no unexpected errors are produced

Set nodeAffinity for critical workloads (optional)

Autoscaled nodes can be prone to churn and result in workload disturbance.

You may want to set a nodeAffinity critical cluster workloads to mitigate this.

Some examples are

coredns
metric-server

add the following to your cluster critical workload deployments

affinity:
  nodeAffinity:
    requiredDuringSchedulingIgnoredDuringExecution:
      nodeSelectorTerms:
      - matchExpressions:
        - key: karpenter.sh/nodepool
          operator: DoesNotExist

Create Node Policy

We need to create a Node Policy in DevZero, and have it target the cluster on which dzKarp was just installed.

Head over to the optimization dashboard, click on "Create Node Policy" and follow the form to create a policy suitable for your needs.

After the Policy is created, click on it in the menu and point it at the cluster you just created via "Create Target".

In about a minute this should create nodepool and nodeclass objects in your kubernetes cluster.

Check them out:

kubectl describe gcenodeclass
kubectl describe nodepools

Migrate workloads onto autoscaled nodes

If your workloads do not have pod disruption budgets set, the following commands will cause periods of workload unavailability

If you have cluster-autoscaler installed, it must be disabled first, scale its deployment down to zero before you proceed.

The most cost ideal state of running dzKarp is to have as few manually configured nodes as possible. The below command will drain all nodepools not called dzkarp

for POOL_NAME in $(gcloud container node-pools list --cluster=$CLUSTER_NAME --location=$REGION --format="value(name)" | grep -v '^dzkarp$'); do
    echo "Scaling down nodepool: $POOL_NAME to 0 nodes (and disabling autoscaling)..."
    gcloud container clusters update $CLUSTER_NAME \
        --no-enable-autoscaling \
        --node-pool=$POOL_NAME \
        --location=$REGION \
        --quiet
    gcloud container clusters resize $CLUSTER_NAME \
        --node-pool=$POOL_NAME \
        --num-nodes=0 \
        --location=$REGION \
        --quiet
done

If you have a lot of nodes or workloads you may want to slowly scale down your node groups by a few instances at a time. It is recommended to watch the transition carefully for workloads that may not have enough replicas running or disruption budgets configured.

As nodepools are drained you can verify that dzKarp is creating nodes for your workloads.

kubectl logs -n karpenter-system deployment/karpenter

You should also see new nodes created in your cluster as the old nodes are removed.

kubectl get nodes

Onboard onto DevZero Karp for GKE

On this page