Skip to main content

[COUPANG] Server Diagnostic Guide: Collecting Critical Information

This guide is designed to help retrive issues in your CircleCI self-hosted server environment. Follow the step-by-step instructions to gather necessary diagnostic information.

🚨 Important: Timely Log Collection 🚨

Collecting a support bundle is crucial and time-sensitive for all issues. CircleCI logs are retained for a limited time, and log rotation may cause critical information to be lost.

⚠️ Create support bundles within 10 minutes of the issue occurring to prevent loss of relevant logs, which can make diagnosis extremely difficult.

Quick Reference

Issue Type

Essential Logs

Commands

Docker Executor Issues (Job Delay, Infra Fail)

Support bundle + Nomad alloc logs

Machine Executor Issues (Job Delay, Infra Fail)

Support bundle + Journalctl logs

Permission Problems

Support bundle + AWS error messages

CircleCI API Connection Issues

Support bundle + API request logs with -vvv

Custom Integration Issues

Support bundle + Integration logs -vvv

Initial Diagnostics

Support Bundle Collection (REQUIRED)

For all issues, start by collecting a support bundle:

kubectl support-bundle https://raw.githubusercontent.com/CircleCI-Public/server-scripts/main/support/support-bundle.yaml -n circleci-server

Important Notes:

  1. If you receive a timeout error with RabbitMQ, the bundle may still contain valuable information. .

  2. Run this command as soon as possible after observing an issue. If the issue is past that, please rerun or replicate to ensure that logs are captured.

  3. Include the job ID and timestamp of when the issue occurred when submitting to support

Retrieving Job Details (IMPORTANT)

To get complete job information for troubleshooting, collect the job details from the API:

Replace with your server domain and appropriate job info

curl -H "Circle-Token:${CIRCLE_TOKEN}" -s "https://[REDACTED-COMPANY].net/api/v1.1/project/github/organization/project/[JOB_NUM]" | tee job-details.json

This will provide essential information like: - Step timing details - Build parameters - Start and completion times - Job history

Docker Executor Issues

If experiencing delays between job steps or infrastructure failures:

  1. Collect a support bundle immediately (see above).

  2. Get the specific job ID from the CircleCI UI or API.

  3. Check the Nomad allocation:

    kubectl exec -it $(kubectl get pods -l app=nomad-server -n circleci-server -o name | head -1) -n circleci-server -- nomad status <job-id>
  4. Critically important: Examine allocation logs for the specific job:

    kubectl exec -it $(kubectl get pods -l app=nomad-server -n circleci-server -o name | head -1) -n circleci-server -- nomad alloc logs -stderr <allocation-id>

For comprehensive logging of all running jobs and containers, use this script to collect detailed information:

#!/bin/bashmkdir -p ba-logsnomad_server_pod_name=$(kubectl get pods -l app=nomad-server -n circleci-server -o jsonpath='{.items[0].metadata.name}')while :; do
    kubectl exec $nomad_server_pod_name -n circleci-server -- nomad status | tail -n +2 | awk '{ print $1 }' | while read -r job; do
        date=$(date +%s)
        mkdir -p "ba-logs/${date}/${job}"        # shellcheck disable=SC2024
        kubectl exec $nomad_server_pod_name -n circleci-server -- nomad status "${job}" &"ba-logs/${date}/${job}/status.txt"
        # shellcheck disable=SC2024
        kubectl exec $nomad_server_pod_name -n circleci-server -- nomad logs -stderr -job "${job}" &"ba-logs/${date}/${job}/stderr.txt"        kubectl exec $nomad_server_pod_name -n circleci-server -- nomad status "${job}" | tail -n +18 | awk "{ print \$1 }" | while read -r job_alloc; do
            kubectl exec $nomad_server_pod_name -n circleci-server -- nomad alloc exec "${job_alloc}" docker ps -a &"ba-logs/${date}/${job}/docker-ps.txt"
            
            kubectl exec $nomad_server_pod_name -n circleci-server -- nomad alloc exec "${job_alloc}" docker ps -a | tail -n +2 | awk "{ print \$1 }" | while read -r containerid; do
                kubectl exec $nomad_server_pod_name -n circleci-server -- nomad alloc exec "${job_alloc}" docker logs $containerid &"ba-logs/${date}/${job}/${containerid}.txt"
            done
        done
    done    find ba-logs -type f -mtime +1 -exec rm {} \;
    find ba-logs -mindepth 1 -type d -exec bash -c 'rmdir "$1" & /dev/null || true' shell {} \;
    echo "..."
    sleep 1
done

Note: Remember to modify the namespace in the script from circleci-server to circleci-server to match your environment.

Machine Executor Issues

For issues with machine executors:

  1. Collect a support bundle immediately (within 10 minutes of the issue).

  2. Add the following step to your CircleCI configuration to capture system logs during job execution:

jobs:
  your-job-name:
    machine: true
    steps:
      # Your regular job steps here
      
      # Add this step to capture system logs
      - run:
          name: Retrieve system logs
          command: journalctl --no-pager -f
          background: true
          when: always
  1. This will ensure system logs are captured regardless of whether the job succeeds or fails.

  2. Check machine provisioner logs:

    kubectl logs -l app=machine-provisioner-provisioner -n circleci-server > machine-provisioner-logs.txt
  3. Look for resource constraints or network connectivity issues in the logs:

    • Disk space errors: No space left on device

    • Network timeouts: Connection timed out

    • AWS permission errors: UnauthorizedOperation

    • Resource allocation issues: Cannot allocate memory

  4. For EC2 instance issues, check AWS permission and authorization errors (see AWS Permission Issues).

AWS Permission Issues

When encountering AWS errors:

  1. Collect a support bundle immediately (within 10 minutes of error occurrence).

  2. Extract the encoded authorization message from logs.

  3. Decode the message:

    aws sts decode-authorization-message --encoded-message "<encoded_message>"
  4. Check AWS CloudTrail logs for denied actions (highly recommended):
    Recommend to access from the UI but following command might also help give more information https://docs.aws.amazon.com/awscloudtrail/latest/userguide/cloudtrail-user-guide.html

    # Search CloudTrail logs for errors related to the IAM role
    aws cloudtrail lookup-events --lookup-attributes AttributeKey=Username,AttributeValue= --max-items 100
        
    # Filter CloudTrail for specific error events
    aws cloudtrail lookup-events --lookup-attributes AttributeKey=EventName,AttributeValue=RunInstances --max-items 100# Check for specific errors in recent CloudTrail logs
    aws cloudtrail lookup-events --start-time $(date -u -d "1 hour ago" +"%Y-%m-%dT%H:%M:%SZ") --query "Events[?contains(CloudTrailEvent, 'errorCode') || contains(CloudTrailEvent, 'errorMessage')]"
  5. Verify the relevant IAM policies for:

    • Resource access (check account IDs in ARNs)

    • Service control policies (look for explicit denies)

    • Cross-account access permissions

    • Correct region specification in resource ARNs

API Connection Issues

If experiencing issues with API connections:

  1. Collect a support bundle immediately (within 10 minutes of the issue).

  2. Capture API response and request details with verbose output:

    #For curl commands, add the -vvv flag to see detailed request/response information
    curl -vvv -X POST "https://[REDACTED-COMPANY].net/api/v2/workflow/approve/[BUILD_NUM]"
  3. Check Nginx logs for API-related errors:

    kubectl logs -l app=nginx -n circleci-server --tail=500 > nginx-logs.txt
  4. Look for specific HTTP response codes and timing:

    • 404 responses might indicate the job is not yet ready.

    • 403 responses might indicate permission issues.

    • Slow responses (>1s) might indicate backend processing delays.

  5. For webhook or approval timing issues, capture timestamps of:

    • Job completion events in logs.

    • API call attempts.

    • Webhook delivery attempts.

Custom Integration Issues

For issues with custom integrations (GitHub Enterprise, proxy setups, etc.):

  1. Collect a support bundle immediately (within 10 minutes of the issue).

  2. For GitHub Enterprise integration issues:

    • Capture GitHub webhook delivery logs (from GitHub Enterprise UI)

    • Check TLS certificate configuration

    • Verify network connectivity between CircleCI and GitHub Enterprise

  3. For proxy integrations:

    • Collect complete request/response cycles including headers

    • Log both incoming and outgoing payloads (if possible)

    • Verify that signatures and headers are preserved through the proxy

Contacting Support

When submitting a ticket, please include:

  1. The support bundle (collected within 10 minutes of the issue).

  2. Specific error messages.

  3. Exact job ID and URL experiencing the issue.

  4. Timestamps when the issue occurred.

  5. Any relevant AWS error messages or decoded authorization messages.

  6. For job timing issues: Complete Nomad allocation logs.

  7. For machine executor issues: System journalctl logs.

  8. For API issues: Request/response details with timestamps (using the -vvv flag).

Did this answer your question?