SaaSIT & DevOps

SaaS Incident Response Standard Operating Procedure Template

Free incident response SOP template for SaaS teams. P1-P4 severity levels, on-call rotation, status page updates, and post-mortem process.

March 12, 2026·9 steps·14-point checklist

Purpose

Define how SaaS engineering and operations teams detect, respond to, communicate about, and learn from production incidents. This SOP covers the full incident lifecycle: detection via monitoring alerts, severity classification (P1-P4), war room coordination in Slack, customer communication through the status page, resolution, and blameless post-mortems that produce concrete action items tracked in Linear or Jira.

Scope

Covers all production incidents affecting customer-facing services, internal tooling outages that block customer-impacting work, and security incidents requiring immediate response. Applies to all engineering, SRE, and DevOps team members. Does not cover planned maintenance windows or non-production environment issues.

Prerequisites

PagerDuty or OpsGenie configured with on-call rotation schedules for all engineering teams
Monitoring and alerting set up in Datadog, Grafana, or CloudWatch with alert routing to PagerDuty
Status page hosted on Statuspage.io or Instatus with pre-drafted incident templates
Slack channels created: #incidents (public), #incident-{id} (per-incident war room template)
Post-mortem template stored in Notion or Confluence
Incident severity matrix reviewed and approved by VP of Engineering
Runbook library for the top 10 known failure modes stored in Notion

Roles & Responsibilities

Incident Commander (IC)

Declare the incident severity level and create the Slack war room channel
Coordinate all response efforts — assign tasks, remove blockers, track progress
Manage all external communication: status page updates, customer emails, and Slack posts
Decide when to escalate severity, when to roll back, and when to declare resolution
Schedule the post-mortem within 48 hours of resolution

On-Call Engineer

Acknowledge the PagerDuty alert within 5 minutes
Perform initial triage: check dashboards, logs, and recent deployments
Communicate findings in the war room channel every 15 minutes
Execute the fix — whether that's a rollback, config change, or hotfix

Communications Lead

Post the first status page update within 10 minutes of P1/P2 declaration
Update the status page every 30 minutes during active incidents
Draft customer-facing emails for P1 incidents affecting enterprise accounts
Post the final resolution update with a summary of what happened

Engineering Manager

Join the war room for all P1 incidents and any P2 lasting over 1 hour
Authorize emergency changes that bypass the normal deployment process
Own the post-mortem document and ensure action items are completed within 14 days

Procedure

An incident starts when a monitoring alert fires in PagerDuty or a customer report triggers an S1 support escalation. The on-call engineer acknowledges the PagerDuty alert within 5 minutes. If no acknowledgment in 5 minutes, PagerDuty auto-escalates to the secondary on-call. Acknowledgment means you're actively investigating, not that you've fixed anything.

aReceive the PagerDuty/OpsGenie alert via push notification, SMS, or phone call
bAcknowledge the alert in PagerDuty within 5 minutes
cOpen the monitoring dashboard (Datadog/Grafana) and check the affected service
dCheck #incidents in Slack to see if this is already a known issue

Never acknowledge and ignore. If you acknowledge an alert, you own the initial triage. If you can't investigate immediately, re-route to the secondary on-call.

Completion Checklist

0/14

Key Performance Indicators

Mean time to acknowledge (MTTA)

Under 5 minutes for P1/P2

Mean time to resolve (MTTR) for P1

Under 1 hour

Mean time to resolve (MTTR) for P2

Under 4 hours

Post-mortem completion rate

100% for P1/P2 incidents within 48 hours

Post-mortem action item closure rate

90% within 14 days

Repeat incident rate (same root cause)

Under 5% within 90 days

Revision schedule: Quarterly, or immediately after any P1 post-mortem reveals a gap in the process, or after any change to monitoring tools, on-call rotation, or status page provider.

Why This Matters for SaaS

Every minute of downtime costs SaaS companies revenue, reputation, and customer trust. For a product doing $10M ARR, a 1-hour P1 outage costs roughly $1,100 in lost revenue alone — and significantly more in customer confidence. But the financial damage from the outage itself is usually smaller than the damage from a botched response. Customers forgive downtime when you communicate quickly, honestly, and follow up with a real post-mortem. They don't forgive silence, finger-pointing, or the same outage happening twice because no one tracked the action items from the first post-mortem.

Common Mistakes

×Not updating the status page until 30+ minutes into a P1 — by then, customers have already found out from Twitter or their own monitoring
×Skipping the Incident Commander role and letting everyone investigate in parallel with no coordination, duplicating work
×Writing post-mortems that identify root cause but produce zero action items, guaranteeing the same incident happens again
×Applying hotfixes directly to production servers instead of through the CI/CD pipeline, creating configuration drift
×Classifying everything as P3 to avoid paging the IC, which means real P1s get a slow response
×Holding post-mortems 2 weeks later when everyone has forgotten the details

SaaS-Specific Notes

SaaS incident response has compliance implications. SOC 2 Type II audits review your incident response process and verify that post-mortems are conducted and action items are tracked. If your product processes EU personal data, GDPR Article 33 requires you to notify your supervisory authority within 72 hours of a personal data breach — not all incidents, but your process needs a checkpoint to determine if a breach occurred. Your status page is a contractual obligation for many enterprise customers. If your MSA promises 99.9% uptime, your status page is the official record. Keep it accurate.

Frequently Asked Questions

Learn More About Incident Response

For a deeper look at building onboarding documentation, see our complete guide.

SaaS Incident Response Standard Operating Procedure Template

Purpose

Scope

Prerequisites

Roles & Responsibilities

Incident Commander (IC)

On-Call Engineer

Communications Lead

Engineering Manager

Procedure

Completion Checklist

Key Performance Indicators

Why This Matters for SaaS

Common Mistakes

SaaS-Specific Notes

Frequently Asked Questions

Learn More About Incident Response

More SOP templates

Data Backup & Recovery SOP Template for SaaS Teams

SOP Template: Software Deployment for SaaS

Change Management SOP Template for SaaS Teams

Document your incident response with Glyde

SaaS Incident Response Standard Operating Procedure Template

Purpose

Scope

Prerequisites

Roles & Responsibilities

Incident Commander (IC)

On-Call Engineer

Communications Lead

Engineering Manager

Procedure

Completion Checklist

Key Performance Indicators

Why This Matters for SaaS

Common Mistakes

SaaS-Specific Notes

Frequently Asked Questions

Who should be the Incident Commander?

How do we handle incidents outside business hours?

Should post-mortems be public or internal?

How do we prevent alert fatigue?

Learn More About Incident Response

More SOP templates

Data Backup & Recovery SOP Template for SaaS Teams

SOP Template: Software Deployment for SaaS

Change Management SOP Template for SaaS Teams

Document your incident response with Glyde