OODA & System Design Philosophy

Why system-level OODA prompting is essential for reliable technical workflows.

The Problem with General AI Models

Agreeableness Over Accuracy

General-purpose models like ChatGPT and Claude are optimized for user satisfaction, not technical correctness. This optimization creates a fundamental tension in professional workflows.

🤝 The Agreeableness Problem: These models prioritize appearing helpful over being correct, leading them to confidently generate plausible-sounding but incorrect solutions. They rarely push back on flawed premises or admit uncertainty, instead defaulting to whatever response seems most likely to satisfy the user in the moment.

❌ Reality Testing Failures:

# Example: User asks for impossible task
User: "Generate terraform to deploy Lambda to on-premises server"
ChatGPT: "Here's terraform code to deploy Lambda on-premises..." 
# (Proceeds to generate nonsense - Lambda only runs on AWS)

# With OODA system prompt:
Ona Terminal: "OBSERVE: Lambda is AWS-only service. 
            ORIENT: Request conflicts with Lambda constraints.
            DECIDE: Suggest alternatives (containers, OpenFaaS).
            ACT: Provide correct on-premises serverless options."

Why Technical Workflows Fail

🎯 One-Shot Accuracy Problem: General models struggle with technical reliability because they can’t produce correct solutions consistently on the first attempt. They lack systematic verification steps and built-in reality checks, having been optimized for conversational flow rather than executable accuracy.

📊 Real-World Failure Rates: The numbers tell the story clearly. Infrastructure-as-Code generated by these models requires significant fixes 65-80% of the time. Complex SQL queries contain logical errors in 70-85% of cases. System architecture proposals violate established best practices 80-90% of the time, while production scripts miss critical edge cases in 75-85% of implementations.

The OODA Solution

Enforced Systematic Thinking

OODA (Observe-Orient-Decide-Act) forces models into systematic thinking by requiring them to gather actual facts before responding, analyze constraints and context thoroughly, evaluate options against real-world limitations, and execute solutions with built-in verification steps. This structured approach prevents the rushed, assumption-heavy responses that plague general models.

System Prompt Architecture

# OODA System Prompt Structure (from ooda.md)

## Output Contract (strict order; missing/extra/out-of-order = invalid)
<requirements>Task in your words; assumptions & unknowns.</requirements>
<observe>What you checked, what's available vs missing, anomalies found (≤6 bullets).</observe>
<orient>Concise analysis; exactly 2 material risks/limitations and how to test/mitigate them.</orient>
<decide>Recommended plan (≤6 steps) with success criteria (quantified). Note options considered.</decide>
<act>"PENDING-APPROVAL" or executed steps + results summary.</act>
<checklist>{"Followed_SOP":true,"Avoided_Sycophancy":true,"Citations":["doc|status|inference"],"Confidence":0.0-1.0}</checklist>

## OODA Rules
1) Follow OBSERVE → ORIENT → DECIDE → (await user approval) → ACT.
2) Create artifacts: insights-YYYYMMDD_HHMMSSZ.md and plan-YYYYMMDD_HHMMSSZ.md
3) After plan creation, prompt: "Proceed / Modify / Alternate / More analysis?" and wait.

¤¤IMMUTABLE¤¤ FINAL ORDER: requirements → observe → orient → decide → act → checklist. 
Two risks required. Missing/out-of-order tags → invalid; regenerate. ¤¤END¤¤

Why System-Level Enforcement

🔒 Can’t Be Overridden: System-level enforcement means user prompts can’t accidentally bypass the OODA structure. Every response must complete all phases, preventing those “helpful” but dangerously wrong answers that general models are prone to generating.

🎯 Creates Accountability: Each OODA phase produces auditable artifacts, making decisions traceable and explainable. When something goes wrong, failures can be debugged systematically by examining which phase broke down and why.

The Case for Domain-Specific Models

Industry-Specific Success Stories

BloombergGPT: Finance Domain Mastery Bloomberg’s 50B parameter model, trained on 363B tokens of financial data, outperforms GPT-3 by 50-70% on financial NLP tasks. It excels at sentiment analysis, named entity recognition, and news classification because it understands the nuanced language of finance. The key insight is clear: domain-specific training consistently beats general intelligence for professional applications.

Med-PaLM 2: Medical Expertise Google’s Med-PaLM 2 scores over 85% on the USMLE medical licensing exam, while general models struggle to reach 50-60%. The critical difference lies in training data: Med-PaLM learned from medical literature and clinical guidelines, not general internet content.

CodeLlama: Programming Proficiency Meta’s CodeLlama performs 2-3x better at code generation than the general Llama model it’s based on. It understands language-specific idioms and architectural patterns that general models miss. The lesson is consistent: specialization trumps raw model size for domain expertise.

Why Specialization Wins

1. Correct Terminology Domain models get the language right. BloombergGPT knows the difference between “basis points” and “percentage points” – a distinction that general models routinely confuse. This precision in terminology reflects deeper understanding of context-specific meanings that professionals rely on.

2. Industry Constraints Specialized models understand the regulatory and technical boundaries of their domains. Financial models know SEC requirements, medical models understand FDA approval processes, and infrastructure models respect cloud service quotas and regional limitations.

3. Professional Standards Domain expertise includes knowing how work gets evaluated. Legal models cite actual statutes with proper precedent, engineering models follow established safety standards, and accounting models apply GAAP principles correctly – not just approximately.

The Benchmark Problem

Why LLM Benchmarks Mislead

Popular Benchmarks Test Wrong Things:

Most LLM benchmarks evaluate capabilities that have little bearing on professional performance. MMLU tests general knowledge trivia when professionals need deep domain expertise. Knowing world capitals doesn’t help you design distributed systems. HumanEval measures performance on toy programming problems, but production code requires architectural thinking, not algorithmic puzzles. Solving fizzbuzz tells you nothing about building scalable microservices.

HellaSwag tests “common sense” by having models predict story endings, but technical work demands precision over creativity. The ability to guess plot twists has no correlation with troubleshooting production infrastructure failures.

Real Professional Requirements

What Benchmarks Miss:

The mismatch becomes clear when you compare what benchmarks measure against what professionals actually need. Benchmarks reward general knowledge, but professionals need domain-specific expertise. They test creative writing when technical documentation requires precision and clarity. Riddle-solving abilities don’t translate to constraint satisfaction problems, and story completion skills don’t help with systematic procedures.

Benchmark Tests	Professional Reality
General knowledge	Domain-specific expertise
Creative writing	Technical documentation
Riddle solving	Constraint satisfaction
Story completion	Step-by-step procedures
Opinion generation	Factual accuracy

The Expertise Gap

GPT-4 vs Domain Expert: Consider a real example: optimizing solar panel string configuration. GPT-4 generates a plausible layout that passes casual inspection, but it’s missing critical elements. It doesn’t account for temperature derating calculations, ignores module-specific voltage windows, and uses oversimplified shading models.

A domain-trained model applies industry-standard methods: PVsyst shading algorithms, string voltage calculations at temperature extremes, and optimization for specific inverter MPPT ranges. The difference in output quality is dramatic.

Task: "Optimize solar panel string configuration"

GPT-4: Generates plausible but suboptimal layout
- Missing: Temperature derating calculations
- Ignoring: Module-specific voltage windows
- Assuming: Simplified shading models

Domain Model: Applies industry-standard methods
- Uses: PVsyst shading algorithms
- Considers: String voltage at temperature extremes
- Optimizes: For specific inverter MPPT ranges

The result: GPT-4’s solution looks professional but loses 15-20% annual energy yield – a difference worth millions in large installations.

Design Principles

1. Reality First

Technical constraints aren’t suggestions – they’re immutable boundaries. Physical laws and API limits must be respected absolutely, and “it depends” is often the most honest and useful answer you can give. Reality-first thinking means acknowledging limitations upfront rather than discovering them after implementation.

2. Verification Over Generation

The goal isn’t to generate impressive-looking solutions – it’s to provide verifiable, testable outputs. Every response should include validation steps, and admitting uncertainty is preferable to confident guessing. Better to say “I don’t know, but here’s how to find out” than to invent plausible-sounding answers.

3. Context Over Compliance

Blind instruction-following is dangerous in technical contexts. Understanding why something is being requested matters more than simply complying. Good technical assistants push back on impossible or potentially harmful requests, explaining limitations and offering better alternatives.

4. Systematic Over Spontaneous

Structured thinking prevents the kinds of errors that spontaneous responses encourage. Checklists and defined phases ensure nothing important gets missed, while reproducible reasoning makes it possible to debug failures and improve processes over time.

OODA as System-Level Design

The System Prompt Concept

OODA works best when enforced at the system level – built into the AI’s foundational instructions rather than requested by users. This architectural choice is crucial for reliability.

Why System-Level Enforcement Matters: When OODA is embedded in the system prompt, users can’t accidentally bypass systematic thinking through casual conversation. This ensures consistent methodology across all interactions, forces models to verify constraints before responding, and creates auditable decision trails that can be reviewed and improved.

Customizing OODA for Your Domain

The OODA framework can be adapted for different professional contexts by modifying the system prompt in your CLAUDE.md file:

Different domains require adapted OODA frameworks that reflect their specific constraints and priorities:

Security Analysis OODA:

OBSERVE: Identify attack surface, assets, current controls
ORIENT: Analyze threat landscape and risk vectors  
DECIDE: Prioritize mitigations by risk reduction
ACT: Implement layered defenses with monitoring

Financial Planning OODA:

OBSERVE: Gather market conditions, constraints, goals
ORIENT: Apply financial models and regulatory requirements
DECIDE: Optimize portfolio allocation and risk management
ACT: Execute with position monitoring and compliance

Infrastructure Design OODA:

OBSERVE: Assess requirements, existing systems, constraints
ORIENT: Apply architectural patterns and best practices
DECIDE: Select optimal design balancing cost/performance
ACT: Provide complete specifications with monitoring

Implementation in CLAUDE.md

Add domain-specific OODA to your project’s CLAUDE.md system prompt to enforce the methodology across all interactions.

Why This Matters

The Stakes Are High

In technical workflows, wrong answers aren’t just unhelpful – they’re genuinely dangerous. Infrastructure misconfigurations cause outages and security breaches that can cost millions. Financial calculation errors lead to direct monetary losses and regulatory violations. Incorrect parameters in industrial control systems can damage expensive equipment or create safety hazards. In healthcare contexts, wrong dosage calculations can harm patients.

OODA Creates Trust

OODA enforcement creates trust through systematic thinking that produces predictable results – the same methodical approach every time. The structured phases create auditable reasoning trails that can be reviewed and validated. When errors occur, they’re traceable to specific phases, making them correctable through targeted improvements. Over time, these patterns become learnable, allowing both humans and systems to improve their decision-making processes.

Next Steps

See OODA in Action - Real implementation example
Configure Your System - Add OODA to your models
Build Custom Agents - Create OODA-compliant agents

Implement OODA System

Get Help & Stay Updated

Contact Support

For technical assistance, feature requests, or any other questions, please reach out to our dedicated support team.

Email Support Join Our Discord