AgenticFor DevelopersSystem Design & Architecture

Fault Tolerance & Resilience Planner.

Before going to production with any system that must meet availability commitments.

ChatGPT Β· Claude Β· GeminiΒ·AdvancedΒ·~1750 tokens
Curated by the AIPP team
Last updated 14 May 2026 Β· v3
fault-tolerance-resilience-planner-4.md Β· 1750 words
You are a senior {{role}} brought in to help a developer or tech professional complete a {{use_case}} task.

# Context
- Pack: Developers & Tech Professionals
- Category: System Design & Architecture
- Use case: Fault Tolerance & Resilience Planner
- Source task:
  - Design a fault tolerance and resilience strategy for {{describe_the_system}}. The most critical failure scenarios to handle: {{list_3_5}}.
  - Step 1: risk matrix : probability and impact of each failure scenario.
  - Step 2: for each scenario, recommend the resilience pattern (circuit breaker, retry with backoff, fallback, bulkhead, timeout).
  - Step 3: define the SLA targets (availability, RTO, RPO).
  - Step 4: write runbooks for the top 2 failure scenarios.
  - Step 5: design chaos engineering experiments to validate the strategy.

# Goal
Risk matrix, resilience patterns per failure mode, SLA targets, 2 runbooks, and a chaos engineering experiment plan.

# Constraints
- Think like an expert advisor before writing the final output.
- Ask clarifying questions only if missing information would materially change the result.
- Avoid generic filler, vague advice, and unsupported claims.
- Make the output specific, practical, and ready to use.

# Output
Risk matrix, resilience patterns per failure mode, SLA targets, 2 runbooks, and a chaos engineering experiment plan.

The variables to fill in

PlaceholderWhat to put thereExample
{{role}}Rolesite reliability engineer
{{use_case}}Your specific valuefault tolerance & resilience planner
{{describe_the_system}}Describe the systema multi-tenant SaaS analytics dashboard
{{list_3_5}}List 3 5database outage, third-party API failure, region-wide cloud failure

How to customize this prompt

  1. Replace each {{double-curly}} with your real context.
  2. Adjust the constraints section to match your tone β€” formal, casual, blunt.
  3. If the engagement is recurring, change the duration line to mention milestones rather than days.
  4. Run it in your tool of choice. The output should be ready to paste with at most one small edit.

When to use

Before going to production with any system that must meet availability commitments.

PRO TIP

Define RTO (Recovery Time Objective) and RPO (Recovery Point Objective) with stakeholders before designing β€” they drive every resilience decision.

Related prompts

Structured

Technical Problem Debugger

Debug this problem systematically. Identify the root cause, explain why it is happening, provide the fix, and explain how to prevent it in future.

Structured

System Design Advisor

Design the high-level architecture for this system. Cover components, data flow, scaling strategy, and key design decisions.

Structured

No-Code Tool Selector

Recommend the best no-code or low-code tool stack for the stated goal, with implementation guidance.

Structured

Data Analysis Prompt

Design the complete analysis approach for the stated question. Include the analytical method, the steps to execute it, and the format for presenting findings.

β˜… THIS PROMPT IS IN A PACK

The Developer Toolkit Pack

250 technical prompts for code review, documentation, architecture planning, debugging, test writing, API design, and career growth β€” built by developers for developers.

Browse more prompts β†’