Entropyk/_bmad-output/implementation-artifacts/4-4-intelligent-fallback-strategy.md

14 KiB

Story 4.4: Intelligent Fallback Strategy

Status: done

Story

As a simulation user, I want automatic fallback with smart return conditions, so that convergence is guaranteed without solver oscillation.

Acceptance Criteria

  1. Auto-Switch on Newton Divergence (AC: #1)

    • Given Newton-Raphson diverging
    • When divergence detected (> 3 increasing residuals)
    • Then auto-switch to Sequential Substitution (Picard)
    • And the switch is logged with tracing::warn!
  2. Return to Newton Only When Stable (AC: #2)

    • Given Picard iteration converging
    • When residual norm falls below return_to_newton_threshold
    • Then attempt to return to Newton-Raphson
    • And if Newton diverges again, stay on Picard permanently
  3. Oscillation Prevention (AC: #3)

    • Given multiple solver switches
    • When switch count exceeds max_fallback_switches (default: 2)
    • Then stay on current solver (Picard) permanently
    • And log the decision with tracing::info!
  4. Configurable Fallback Behavior (AC: #4)

    • Given a FallbackConfig struct
    • When setting fallback_enabled: false
    • Then no fallback occurs (pure Newton or Picard)
    • And return_to_newton_threshold and max_fallback_switches are configurable
  5. Timeout Enforcement Across Switches (AC: #5)

    • Given a solver with timeout configured
    • When fallback occurs
    • Then the timeout applies to the total solving time
    • And each solver inherits the remaining time budget
  6. Pre-Allocated Buffers (AC: #6)

    • Given a finalized System
    • When the fallback solver initializes
    • Then all buffers are pre-allocated once
    • And no heap allocation occurs during solver switches

Tasks / Subtasks

  • Implement FallbackConfig struct in crates/solver/src/solver.rs (AC: #4)

    • Add fallback_enabled: bool (default: true)
    • Add return_to_newton_threshold: f64 (default: 1e-3)
    • Add max_fallback_switches: usize (default: 2)
    • Implement Default trait
  • Implement solve_with_fallback() function (AC: #1, #2, #3, #5, #6)

    • Create FallbackSolver struct wrapping NewtonConfig and PicardConfig
    • Implement main fallback logic with state tracking
    • Track switch_count and current_solver enum
    • Implement Newton → Picard switch on divergence
    • Implement Picard → Newton return when below threshold
    • Implement oscillation prevention (max switches)
    • Handle timeout across solver switches (remaining time)
    • Add tracing::warn! for switches, tracing::info! for decisions
  • Implement Solver trait for FallbackSolver (AC: #1-#6)

    • Delegate to solve_with_fallback() in solve() method
    • Implement with_timeout() builder pattern
  • Integration tests (AC: #1, #2, #3, #4, #5, #6)

    • Test Newton diverges → Picard converges
    • Test Newton diverges → Picard stabilizes → Newton returns
    • Test oscillation prevention (max switches reached)
    • Test fallback disabled (pure Newton behavior)
    • Test timeout applies across switches
    • Test no heap allocation during switches

Dev Notes

Epic Context

Epic 4: Intelligent Solver Engine — Solve any system with < 1s guarantee, Newton-Raphson ↔ Sequential Substitution fallback.

Story Dependencies:

  • Story 4.1 (Solver Trait Abstraction) — DONE: Solver trait, SolverError, ConvergedState defined
  • Story 4.2 (Newton-Raphson Implementation) — DONE: Full Newton-Raphson with line search, timeout, divergence detection
  • Story 4.3 (Sequential Substitution) — DONE: Picard implementation with relaxation, timeout, divergence detection
  • Story 4.5 (Time-Budgeted Solving) — NEXT: Extends timeout handling with best-state return
  • Story 4.8 (Jacobian Freezing) — Newton-specific optimization, not applicable to fallback

FRs covered: FR16 (Auto-fallback solver switching), FR17 (timeout), FR18 (best state on timeout), FR20 (convergence criterion)

Architecture Context

Technical Stack:

  • thiserror for error handling (already in solver)
  • tracing for observability (already in solver)
  • std::time::Instant for timeout enforcement across switches

Code Structure:

  • crates/solver/src/solver.rs — FallbackSolver implementation
  • crates/solver/src/system.rs — EXISTING: System with compute_residuals()

Relevant Architecture Decisions:

  • Solver Architecture: Trait-based static polymorphism with enum dispatch [Source: architecture.md]
  • No allocation in hot path: Pre-allocate all buffers before iteration loop [Source: architecture.md]
  • Error Handling: Centralized error enum with thiserror [Source: architecture.md]
  • Zero-panic policy: All operations return Result [Source: architecture.md]

Developer Context

Existing Implementation (Story 4.1 + 4.2 + 4.3):

// crates/solver/src/solver.rs
pub struct NewtonConfig {
    pub max_iterations: usize,      // default: 100
    pub tolerance: f64,             // default: 1e-6
    pub line_search: bool,          // default: false
    pub timeout: Option<Duration>,  // default: None
    pub divergence_threshold: f64,  // default: 1e10
    // ... other fields
}

pub struct PicardConfig {
    pub max_iterations: usize,      // default: 100
    pub tolerance: f64,             // default: 1e-6
    pub relaxation_factor: f64,     // default: 0.5
    pub timeout: Option<Duration>,  // default: None
    pub divergence_threshold: f64,  // default: 1e10
    pub divergence_patience: usize, // default: 5
}

pub enum SolverStrategy {
    NewtonRaphson(NewtonConfig),
    SequentialSubstitution(PicardConfig),
}

Divergence Detection Already Implemented:

  • Newton: 3 consecutive residual increases → SolverError::Divergence
  • Picard: 5 consecutive residual increases → SolverError::Divergence

Technical Requirements

Intelligent Fallback Algorithm:

Input: System, FallbackConfig, timeout
Output: ConvergedState or SolverError

1. Initialize:
   - start_time = Instant::now()
   - switch_count = 0
   - current_solver = NewtonRaphson
   - remaining_time = timeout

2. Main fallback loop:
   a. Run current solver with remaining_time
   b. If converged → return ConvergedState
   c. If timeout → return Timeout error
   
   d. If Divergence and current_solver == NewtonRaphson:
      - If switch_count >= max_fallback_switches:
        - Log "Max switches reached, staying on Newton (will fail)"
        - Return Divergence error
      - Switch to Picard
      - switch_count += 1
      - Log "Newton diverged, switching to Picard (switch #{switch_count})"
      - Continue loop
   
   e. If Picard converging and residual < return_to_newton_threshold:
      - If switch_count < max_fallback_switches:
        - Switch to Newton
        - switch_count += 1
        - Log "Picard stabilized, attempting Newton return"
        - Continue loop
      - Else:
        - Stay on Picard until convergence or failure
   
   f. If Divergence and current_solver == Picard:
      - Return Divergence error (no more fallbacks)

3. Return result

Key Design Decisions:

Decision Rationale
Start with Newton Quadratic convergence when it works
Max 2 switches Prevent infinite oscillation
Return threshold 1e-3 Newton works well near solution
Track remaining time Timeout applies to total solve
Stay on Picard after max switches Picard is more robust

State Tracking:

enum CurrentSolver {
    Newton,
    Picard,
}

struct FallbackState {
    current_solver: CurrentSolver,
    switch_count: usize,
    newton_attempts: usize,
    picard_attempts: usize,
}

Timeout Handling Across Switches:

fn solve_with_timeout(&mut self, system: &mut System, timeout: Duration) -> Result<ConvergedState, SolverError> {
    let start_time = Instant::now();
    
    loop {
        let elapsed = start_time.elapsed();
        let remaining = timeout.saturating_sub(elapsed);
        
        if remaining.is_zero() {
            return Err(SolverError::Timeout { timeout_ms: timeout.as_millis() as u64 });
        }
        
        // Run current solver with remaining time
        let solver_timeout = self.current_solver_timeout(remaining);
        match self.run_current_solver(system, solver_timeout) {
            Ok(state) => return Ok(state),
            Err(SolverError::Timeout { .. }) => return Err(SolverError::Timeout { ... }),
            Err(SolverError::Divergence { .. }) => {
                if !self.handle_divergence() {
                    return Err(...);
                }
            }
            other => return other,
        }
    }
}

Architecture Compliance

  • NewType pattern: Use Pressure, Temperature from core where applicable
  • No bare f64 in public API where physical meaning exists
  • tracing: Use tracing::warn! for switches, tracing::info! for decisions
  • Result<T, E>: All fallible operations return Result
  • approx: Use assert_relative_eq! in tests for floating-point comparisons
  • Pre-allocation: All buffers allocated once before fallback loop

Library/Framework Requirements

  • thiserror — Error enum derive (already in solver)
  • tracing — Structured logging (already in solver)
  • std::time::Instant — Timeout enforcement across switches

File Structure Requirements

Modified files:

  • crates/solver/src/solver.rs — Add FallbackConfig, FallbackSolver, implement Solver trait

Tests:

  • Unit tests in solver.rs (fallback logic, oscillation prevention, timeout)
  • Integration tests in tests/ directory (full system solving with fallback)

Testing Requirements

Unit Tests:

  • FallbackConfig defaults are sensible
  • Newton diverges → Picard converges
  • Oscillation prevention triggers at max switches
  • Fallback disabled behaves as pure solver
  • Timeout applies across switches

Integration Tests:

  • Stiff system where Newton diverges but Picard converges
  • System where Picard stabilizes and Newton returns
  • System that oscillates and gets stuck on Picard
  • Compare iteration counts: Newton-only vs Fallback

Performance Tests:

  • No heap allocation during solver switches
  • Convergence time < 1s for standard cycle (NFR1)

Previous Story Intelligence (4.3)

Picard Implementation Complete:

  • PicardConfig::solve() fully implemented with all features
  • Pre-allocated buffers pattern established
  • Timeout enforcement via std::time::Instant
  • Divergence detection (5 consecutive increases)
  • Relaxation factor for stability
  • 37 unit tests in solver.rs, 29 integration tests

Key Patterns to Follow:

  • Use residual_norm() helper for L2 norm calculation
  • Use check_divergence() pattern with patience parameter
  • Use tracing::debug! for iteration logging
  • Use tracing::info! for convergence events
  • Return ConvergedState::new() on success

Fallback-Specific Considerations:

  • Track state across solver invocations
  • Preserve system state between switches
  • Log all decisions for debugging
  • Handle partial convergence gracefully

Git Intelligence

Recent commits show:

  • be70a7a — feat(core): implement physical types with NewType pattern
  • Epic 1-3 complete (components, fluids, topology)
  • Story 4.1 complete (Solver trait abstraction)
  • Story 4.2 complete (Newton-Raphson implementation)
  • Story 4.3 complete (Sequential Substitution implementation)
  • Ready for Intelligent Fallback implementation

Project Context Reference

  • FR16: [Source: epics.md — Solver automatically switches to Sequential Substitution if Newton-Raphson diverges]
  • FR17: [Source: epics.md — Solver respects configurable time budget (timeout)]
  • FR18: [Source: epics.md — On timeout, solver returns best known state with NonConverged status]
  • FR20: [Source: epics.md — Convergence criterion checks Delta Pressure < 1 Pa (1e-5 bar)]
  • NFR1: [Source: prd.md — Steady State convergence time < 1 second for standard cycle in Cold Start]
  • NFR4: [Source: prd.md — No dynamic allocation in solver loop (pre-calculated allocation only)]
  • Solver Architecture: [Source: architecture.md — Trait-based static polymorphism with enum dispatch]
  • Error Handling: [Source: architecture.md — Centralized error enum with thiserror]

Story Completion Status

  • Status: ready-for-dev
  • Completion note: Ultimate context engine analysis completed — comprehensive developer guide created

Change Log

  • 2026-02-18: Story 4.4 created from create-story workflow. Ready for dev.
  • 2026-02-18: Story 4.4 implementation complete. All tasks done, tests passing.
  • 2026-02-18: Code review completed. Fixed HIGH issues: AC #2 Newton return logic, AC #3 max switches behavior, Newton re-divergence handling. Fixed MEDIUM issues: Config cloning optimization, improved oscillation prevention tests.

Dev Agent Record

Agent Model Used

Claude 3.5 Sonnet (claude-3-5-sonnet)

Debug Log References

No blocking issues encountered during implementation.

Completion Notes List

  • Implemented FallbackConfig struct with all required fields and Default trait
  • Implemented FallbackSolver struct wrapping NewtonConfig and PicardConfig
  • Implemented intelligent fallback algorithm with state tracking
  • Newton → Picard switch on divergence with tracing::warn! logging
  • Picard → Newton return when residual below threshold with tracing::info! logging
  • Oscillation prevention via max_fallback_switches configuration
  • Timeout enforcement across solver switches (remaining time budget)
  • Pre-allocated buffers in underlying solvers (no heap allocation during switches)
  • Implemented Solver trait for FallbackSolver with solve() and with_timeout()
  • Added 12 unit tests for FallbackConfig and FallbackSolver
  • Added 16 integration tests covering all acceptance criteria
  • All 109 unit tests + 16 integration tests + 13 doc tests pass

File List

Modified:

  • crates/solver/src/solver.rs — Added FallbackConfig, FallbackSolver, CurrentSolver enum, FallbackState struct, and Solver trait implementation

Created:

  • crates/solver/tests/fallback_solver.rs — Integration tests for FallbackSolver

Updated:

  • _bmad-output/implementation-artifacts/sprint-status.yaml — Updated story status to "in-progress" then "review"