SaaS Incident Response Standard Operating Procedure Template
Free incident response SOP template for SaaS teams. P1-P4 severity levels, on-call rotation, status page updates, and post-mortem process.
Purpose
Define how SaaS engineering and operations teams detect, respond to, communicate about, and learn from production incidents. This SOP covers the full incident lifecycle: detection via monitoring alerts, severity classification (P1-P4), war room coordination in Slack, customer communication through the status page, resolution, and blameless post-mortems that produce concrete action items tracked in Linear or Jira.
Scope
Covers all production incidents affecting customer-facing services, internal tooling outages that block customer-impacting work, and security incidents requiring immediate response. Applies to all engineering, SRE, and DevOps team members. Does not cover planned maintenance windows or non-production environment issues.
Prerequisites
- PagerDuty or OpsGenie configured with on-call rotation schedules for all engineering teams
- Monitoring and alerting set up in Datadog, Grafana, or CloudWatch with alert routing to PagerDuty
- Status page hosted on Statuspage.io or Instatus with pre-drafted incident templates
- Slack channels created: #incidents (public), #incident-{id} (per-incident war room template)
- Post-mortem template stored in Notion or Confluence
- Incident severity matrix reviewed and approved by VP of Engineering
- Runbook library for the top 10 known failure modes stored in Notion
Roles & Responsibilities
Incident Commander (IC)
- Declare the incident severity level and create the Slack war room channel
- Coordinate all response efforts — assign tasks, remove blockers, track progress
- Manage all external communication: status page updates, customer emails, and Slack posts
- Decide when to escalate severity, when to roll back, and when to declare resolution
- Schedule the post-mortem within 48 hours of resolution
On-Call Engineer
- Acknowledge the PagerDuty alert within 5 minutes
- Perform initial triage: check dashboards, logs, and recent deployments
- Communicate findings in the war room channel every 15 minutes
- Execute the fix — whether that's a rollback, config change, or hotfix
Communications Lead
- Post the first status page update within 10 minutes of P1/P2 declaration
- Update the status page every 30 minutes during active incidents
- Draft customer-facing emails for P1 incidents affecting enterprise accounts
- Post the final resolution update with a summary of what happened
Engineering Manager
- Join the war room for all P1 incidents and any P2 lasting over 1 hour
- Authorize emergency changes that bypass the normal deployment process
- Own the post-mortem document and ensure action items are completed within 14 days
Procedure
An incident starts when a monitoring alert fires in PagerDuty or a customer report triggers an S1 support escalation. The on-call engineer acknowledges the PagerDuty alert within 5 minutes. If no acknowledgment in 5 minutes, PagerDuty auto-escalates to the secondary on-call. Acknowledgment means you're actively investigating, not that you've fixed anything.
- aReceive the PagerDuty/OpsGenie alert via push notification, SMS, or phone call
- bAcknowledge the alert in PagerDuty within 5 minutes
- cOpen the monitoring dashboard (Datadog/Grafana) and check the affected service
- dCheck #incidents in Slack to see if this is already a known issue
Completion Checklist
Key Performance Indicators
Mean time to acknowledge (MTTA)
Under 5 minutes for P1/P2
Mean time to resolve (MTTR) for P1
Under 1 hour
Mean time to resolve (MTTR) for P2
Under 4 hours
Post-mortem completion rate
100% for P1/P2 incidents within 48 hours
Post-mortem action item closure rate
90% within 14 days
Repeat incident rate (same root cause)
Under 5% within 90 days
Why This Matters for SaaS
Every minute of downtime costs SaaS companies revenue, reputation, and customer trust. For a product doing $10M ARR, a 1-hour P1 outage costs roughly $1,100 in lost revenue alone — and significantly more in customer confidence. But the financial damage from the outage itself is usually smaller than the damage from a botched response. Customers forgive downtime when you communicate quickly, honestly, and follow up with a real post-mortem. They don't forgive silence, finger-pointing, or the same outage happening twice because no one tracked the action items from the first post-mortem.
Common Mistakes
- ×Not updating the status page until 30+ minutes into a P1 — by then, customers have already found out from Twitter or their own monitoring
- ×Skipping the Incident Commander role and letting everyone investigate in parallel with no coordination, duplicating work
- ×Writing post-mortems that identify root cause but produce zero action items, guaranteeing the same incident happens again
- ×Applying hotfixes directly to production servers instead of through the CI/CD pipeline, creating configuration drift
- ×Classifying everything as P3 to avoid paging the IC, which means real P1s get a slow response
- ×Holding post-mortems 2 weeks later when everyone has forgotten the details
SaaS-Specific Notes
SaaS incident response has compliance implications. SOC 2 Type II audits review your incident response process and verify that post-mortems are conducted and action items are tracked. If your product processes EU personal data, GDPR Article 33 requires you to notify your supervisory authority within 72 hours of a personal data breach — not all incidents, but your process needs a checkpoint to determine if a breach occurred. Your status page is a contractual obligation for many enterprise customers. If your MSA promises 99.9% uptime, your status page is the official record. Keep it accurate.
Frequently Asked Questions
Learn More About Incident Response
For a deeper look at building onboarding documentation, see our complete guide.