Skip to main content

Nomad Autoscaler 404 Error for GCE Managed Instance Group After Upgrade to CircleCI Server 4.9.x

Overview

After upgrading CircleCI Server to 4.9.x, the following errors appear in the nomad-autoscaler pod logs:

[WARN] policy_manager.policy_handler: failed to get target status: policy_id=<policy_id> error="failed to describe GCE Managed Instance Group: googleapi: Error 404: The resource '<resource_name>/prod-nomad' was not found, notFound"[ERROR] policy_manager.policy_handler: failed to describe GCE Managed Instance Group: googleapi: Error 404: The resource '<resource_name>/prod-nomad' was not found, notFound: policy_id=<policy_id>
    

The autoscaler fails to locate the GCE Managed Instance Group (MIG), and Nomad client scaling stops functioning.

Root Cause

The issue is caused by the change introduced into the google_compute_instance_group_manager terraform resource in the server-terraform module

Prior to 4.9.0, the MIG name was:

name = "${var.name}-nomad" 
# e.g. "prod-nomad"

From 4.9.0, the MIG name changed to:

name = "${var.name}-nomad-client-group"
# e.g. "prod-nomad-client-group""

Solution

  1. Confirm the current MIG name in Terraform state

    terraform state show google_compute_instance_group_manager.nomad
    # or
    terraform show | grep -A5 "nomad_client_group"
  2. Update values.yaml with the correct MIG name

    nomad:
      auto_scaler:
        gcp:
          mig_name: "prod-nomad-client-group"
        
  3. Apply the Helm upgrade

    helm upgrade

Please be informed that you may also need to run kubectl rollout restart deployment/nomad-autoscaler -n <circleci_namespace> so the pod definitely picks up the new mounted policy

Did this answer your question?