Skip to main content

[SERVER] CircleCI Server Incident Response: What to Expect

Overview

At CircleCI, we take Server incidents seriously and have a dedicated process to ensure rapid response and resolution when critical issues impact your operations. This guide explains what you can expect when you experience a P0 (Priority 0) incident.

What Qualifies as a P0 Incident?

A P0 incident is declared when you experience:

  • Fatal system failure (server down or unresponsive)

  • Builds not running

  • Outage impacting critical system operations

  • Security breach

  • Expired license preventing operations

How to Report an Incident

  1. Submit a Zendesk ticket marking it as P0/Urgent

  2. Provide a support bundle from your server installation

  3. Run Reality Check if possible before submitting

Our Support team will review your ticket to confirm the severity and rule out external factors like cloud provider outages.

What Happens Next?

Step 1: Initial Response

  • A Support Engineer will reach out and start a Zoom call with you

  • They'll perform basic checks to understand the scope and impact

  • If engineering escalation is needed, they'll be called in

Step 2: Incident Declaration

  • An Incident Commander will be called in, usually a CircleCI Engineering Manager

  • Members of the appropriate engineering team will be added to the Zoom call as needed

Step 3: Active Response

You'll work directly with:

  • Support Engineer: Your primary point of contact who updates and keeps you informed

  • Incident Commander: Coordinates the technical response and ensures the right resources are engaged

  • Engineering Response Team: Engineer/s from the team that owns the affected service will join to investigate and resolve as necessary

Step 4: Resolution

The team works continuously until:

  • The issue is resolved, OR

  • We mutually agree to pause and resume at a scheduled later time

During the Incident

What you can expect:

  • Regular updates every 30 minutes on progress and current state

  • Direct access to engineering resources via Zoom

  • Clear communication about what's being tried and what we're learning about your situation

  • Coordination with your Field Engineer if necessary

What we need from you:

  • Access to logs, metrics, and system information

  • Details about recent changes to your environment

  • Availability of team members who can provide context or make necessary changes

After Resolution

Within 7 days of resolution, you'll receive:

  • A detailed Root Cause Analysis (RCA) document explaining what happened

  • Specific corrective actions we're taking to prevent similar incidents

  • Recommendations for your environment if applicable

We also conduct internal Post Incident Reviews to continuously improve our response process and product reliability.

Important Notes

  • Support bundles are critical: These contain the diagnostic information we need to troubleshoot quickly

  • Zoom calls: We'll create a recorded Zoom call for the incident response (for our internal documentation)

  • No after-hours delays: If your incident occurs outside business hours, our on-call team will respond

Before You Need Us

To prepare for potential incidents:

Questions?

If you have questions about our incident response process or want to discuss your specific environment, please reach out to your Technical Success Manager or Field Engineer.

Did this answer your question?