What Are Infrastructure Failures?
Infrastructure failures occur when a job cannot complete due to issues with the underlying execution environment rather than problems in your code or configuration. CircleCI automatically distinguishes these from user errors to provide better diagnostics and enable automatic recovery through retries.
Infrastructure Failure vs. User Error
Infrastructure Failures:
Problems with the execution environment (container, VM, or runner)
Network connectivity issues during job setup
Resource allocation or provisioning failures
Communication loss between CircleCI and the execution environment
User Errors:
Test failures or build errors in your code
Misconfigured commands or invalid syntax
Missing dependencies or incorrect environment variables
Out-of-memory errors from your application
How CircleCI Detects Infrastructure Failures
CircleCI uses four primary detection mechanisms:
Failure Type | Description | What You'll See | Typical Cause |
Startup Timeout | Job doesn't start within 10 minutes | "Job failed to start within the expected timeframe" | Resource constraints, provisioning delays |
Heartbeat Timeout | No communication for 5+ minutes | "Lost connection to job executor" | Network issues, executor crash |
Task Infrastructure Failure | Executor reports infrastructure problem | "Infrastructure error detected during job execution" | Container/VM issues, disk failures |
Explicit Infrastructure Failure | Direct failure signal from executor | Specific error message from the system | Critical system errors requiring immediate failure |
Executor-Specific Behaviors
Docker Executor
Common Infrastructure Failures:
Container fails to start or pull image
Docker daemon becomes unresponsive
Network connectivity lost during execution
Example Scenario:
# Your job configuration
docker:
- image: node:14
If the Docker daemon cannot start this container within 10 minutes, it triggers a startup timeout infrastructure failure.
Machine Executor (Linux/Windows)
Common Infrastructure Failures:
VM fails to provision or boot
Network configuration fails during setup
Machine becomes unresponsive after starting
Example Scenario:
machine:
image: ubuntu-2204:current
A machine executor job that loses network connectivity after provisioning will be marked as an infrastructure failure.
Remote Docker
Common Infrastructure Failures:
Cannot establish connection to remote Docker environment
Docker layer pull timeouts
Remote Docker daemon crashes
Example Scenario: When using setup_remote_docker, if the remote Docker environment fails to become available, the job fails with an infrastructure error.
Self-Hosted Runners (Container & Machine)
Common Infrastructure Failures:
Runner goes offline during job execution
Insufficient resources on self-hosted infrastructure
Network connectivity issues between runner and CircleCI
Important: For self-hosted runners, infrastructure failures often indicate issues with your infrastructure rather than CircleCI's platform. Check your runner logs and resource availability.
Troubleshooting Guide: Self-Hosted Runner Troubleshooting
Automatic Retry Behavior
How Retries Work
Infrastructure failures trigger automatic retries with these characteristics:
Retry Limit: Maximum of 2 automatic retries (3 total attempts, including the original)
Smart Retry Logic: Only retries when safe and likely to succeed
When Retries Occur
Always Retried (when under limit):
Startup timeouts on cloud executors
Pre-user-step failures (before your commands run)
Conditionally Retried:
Heartbeat timeouts (only if user steps haven't started)
Task infrastructure failures (only if user steps haven't started)
Never Retried:
Self-hosted runner failures (requires manual intervention)
Explicit infrastructure failures with critical errors
After 2 retry attempts have been exhausted
Common Real-World Scenarios
Scenario 1: Docker Image Pull Timeout
What happens: Large Docker image takes too long to download Detection: Startup timeout after 10 minutes
Result: Automatic retry with potential success on second attempt
Scenario 2: Spot Instance Termination
NOTE: This only applies to self-hosted runner customers
What happens: AWS spot instance terminated during job Detection: Heartbeat timeout Result: Automatic retry on different instance
Scenario 3: Runner Out of Disk Space
What happens: Self-hosted runner runs out of disk space Detection: Task infrastructure failure
Result: Job marked as failed, manual intervention required
Scenario 4: Network Partition
What happens: Temporary network issue during job execution Detection: Heartbeat timeout if user steps haven't started
Result: Automatic retry once connectivity restored
Best Practices
Minimize Infrastructure Failures
For Cloud Executors:
Use smaller Docker images when possible
Implement proper timeout configurations
Monitor job duration trends
For Self-Hosted Runners:
Ensure adequate resources (CPU, memory, disk)
Implement monitoring and alerting
Keep runners updated to latest versions
Maintain stable network connectivity
Configure Retry Settings
version: 2.1
workflows:
my-workflow:
max_auto_reruns: 3
jobs:
- build
- test
- deploy:
requires:
- build
- test
What You See in the UI
When an infrastructure failure occurs:
Job Status: Shows as INFRASTRUCTURE_FAILURE
Error Message: Clear description of the failure type
Retry Indicator: Shows retry attempts (e.g., "Attempt 2 of 3")
Getting Help
If you experience frequent infrastructure failures:
For Cloud Executors: Contact CircleCI Support with job URLs
For Self-Hosted Runners: Check runner logs and system resources first
Review: Status Page for platform-wide issues