# Story 4.4: Intelligent Fallback Strategy Status: done ## Story As a simulation user, I want automatic fallback with smart return conditions, so that convergence is guaranteed without solver oscillation. ## Acceptance Criteria 1. **Auto-Switch on Newton Divergence** (AC: #1) - Given Newton-Raphson diverging - When divergence detected (> 3 increasing residuals) - Then auto-switch to Sequential Substitution (Picard) - And the switch is logged with `tracing::warn!` 2. **Return to Newton Only When Stable** (AC: #2) - Given Picard iteration converging - When residual norm falls below `return_to_newton_threshold` - Then attempt to return to Newton-Raphson - And if Newton diverges again, stay on Picard permanently 3. **Oscillation Prevention** (AC: #3) - Given multiple solver switches - When switch count exceeds `max_fallback_switches` (default: 2) - Then stay on current solver (Picard) permanently - And log the decision with `tracing::info!` 4. **Configurable Fallback Behavior** (AC: #4) - Given a `FallbackConfig` struct - When setting `fallback_enabled: false` - Then no fallback occurs (pure Newton or Picard) - And `return_to_newton_threshold` and `max_fallback_switches` are configurable 5. **Timeout Enforcement Across Switches** (AC: #5) - Given a solver with timeout configured - When fallback occurs - Then the timeout applies to the total solving time - And each solver inherits the remaining time budget 6. **Pre-Allocated Buffers** (AC: #6) - Given a finalized `System` - When the fallback solver initializes - Then all buffers are pre-allocated once - And no heap allocation occurs during solver switches ## Tasks / Subtasks - [x] Implement `FallbackConfig` struct in `crates/solver/src/solver.rs` (AC: #4) - [x] Add `fallback_enabled: bool` (default: true) - [x] Add `return_to_newton_threshold: f64` (default: 1e-3) - [x] Add `max_fallback_switches: usize` (default: 2) - [x] Implement `Default` trait - [x] Implement `solve_with_fallback()` function (AC: #1, #2, #3, #5, #6) - [x] Create `FallbackSolver` struct wrapping `NewtonConfig` and `PicardConfig` - [x] Implement main fallback logic with state tracking - [x] Track `switch_count` and `current_solver` enum - [x] Implement Newton → Picard switch on divergence - [x] Implement Picard → Newton return when below threshold - [x] Implement oscillation prevention (max switches) - [x] Handle timeout across solver switches (remaining time) - [x] Add `tracing::warn!` for switches, `tracing::info!` for decisions - [x] Implement `Solver` trait for `FallbackSolver` (AC: #1-#6) - [x] Delegate to `solve_with_fallback()` in `solve()` method - [x] Implement `with_timeout()` builder pattern - [x] Integration tests (AC: #1, #2, #3, #4, #5, #6) - [x] Test Newton diverges → Picard converges - [x] Test Newton diverges → Picard stabilizes → Newton returns - [x] Test oscillation prevention (max switches reached) - [x] Test fallback disabled (pure Newton behavior) - [x] Test timeout applies across switches - [x] Test no heap allocation during switches ## Dev Notes ### Epic Context **Epic 4: Intelligent Solver Engine** — Solve any system with < 1s guarantee, Newton-Raphson ↔ Sequential Substitution fallback. **Story Dependencies:** - **Story 4.1 (Solver Trait Abstraction)** — DONE: `Solver` trait, `SolverError`, `ConvergedState` defined - **Story 4.2 (Newton-Raphson Implementation)** — DONE: Full Newton-Raphson with line search, timeout, divergence detection - **Story 4.3 (Sequential Substitution)** — DONE: Picard implementation with relaxation, timeout, divergence detection - **Story 4.5 (Time-Budgeted Solving)** — NEXT: Extends timeout handling with best-state return - **Story 4.8 (Jacobian Freezing)** — Newton-specific optimization, not applicable to fallback **FRs covered:** FR16 (Auto-fallback solver switching), FR17 (timeout), FR18 (best state on timeout), FR20 (convergence criterion) ### Architecture Context **Technical Stack:** - `thiserror` for error handling (already in solver) - `tracing` for observability (already in solver) - `std::time::Instant` for timeout enforcement across switches **Code Structure:** - `crates/solver/src/solver.rs` — FallbackSolver implementation - `crates/solver/src/system.rs` — EXISTING: `System` with `compute_residuals()` **Relevant Architecture Decisions:** - **Solver Architecture:** Trait-based static polymorphism with enum dispatch [Source: architecture.md] - **No allocation in hot path:** Pre-allocate all buffers before iteration loop [Source: architecture.md] - **Error Handling:** Centralized error enum with `thiserror` [Source: architecture.md] - **Zero-panic policy:** All operations return `Result` [Source: architecture.md] ### Developer Context **Existing Implementation (Story 4.1 + 4.2 + 4.3):** ```rust // crates/solver/src/solver.rs pub struct NewtonConfig { pub max_iterations: usize, // default: 100 pub tolerance: f64, // default: 1e-6 pub line_search: bool, // default: false pub timeout: Option, // default: None pub divergence_threshold: f64, // default: 1e10 // ... other fields } pub struct PicardConfig { pub max_iterations: usize, // default: 100 pub tolerance: f64, // default: 1e-6 pub relaxation_factor: f64, // default: 0.5 pub timeout: Option, // default: None pub divergence_threshold: f64, // default: 1e10 pub divergence_patience: usize, // default: 5 } pub enum SolverStrategy { NewtonRaphson(NewtonConfig), SequentialSubstitution(PicardConfig), } ``` **Divergence Detection Already Implemented:** - Newton: 3 consecutive residual increases → `SolverError::Divergence` - Picard: 5 consecutive residual increases → `SolverError::Divergence` ### Technical Requirements **Intelligent Fallback Algorithm:** ``` Input: System, FallbackConfig, timeout Output: ConvergedState or SolverError 1. Initialize: - start_time = Instant::now() - switch_count = 0 - current_solver = NewtonRaphson - remaining_time = timeout 2. Main fallback loop: a. Run current solver with remaining_time b. If converged → return ConvergedState c. If timeout → return Timeout error d. If Divergence and current_solver == NewtonRaphson: - If switch_count >= max_fallback_switches: - Log "Max switches reached, staying on Newton (will fail)" - Return Divergence error - Switch to Picard - switch_count += 1 - Log "Newton diverged, switching to Picard (switch #{switch_count})" - Continue loop e. If Picard converging and residual < return_to_newton_threshold: - If switch_count < max_fallback_switches: - Switch to Newton - switch_count += 1 - Log "Picard stabilized, attempting Newton return" - Continue loop - Else: - Stay on Picard until convergence or failure f. If Divergence and current_solver == Picard: - Return Divergence error (no more fallbacks) 3. Return result ``` **Key Design Decisions:** | Decision | Rationale | |----------|-----------| | Start with Newton | Quadratic convergence when it works | | Max 2 switches | Prevent infinite oscillation | | Return threshold 1e-3 | Newton works well near solution | | Track remaining time | Timeout applies to total solve | | Stay on Picard after max switches | Picard is more robust | **State Tracking:** ```rust enum CurrentSolver { Newton, Picard, } struct FallbackState { current_solver: CurrentSolver, switch_count: usize, newton_attempts: usize, picard_attempts: usize, } ``` **Timeout Handling Across Switches:** ```rust fn solve_with_timeout(&mut self, system: &mut System, timeout: Duration) -> Result { let start_time = Instant::now(); loop { let elapsed = start_time.elapsed(); let remaining = timeout.saturating_sub(elapsed); if remaining.is_zero() { return Err(SolverError::Timeout { timeout_ms: timeout.as_millis() as u64 }); } // Run current solver with remaining time let solver_timeout = self.current_solver_timeout(remaining); match self.run_current_solver(system, solver_timeout) { Ok(state) => return Ok(state), Err(SolverError::Timeout { .. }) => return Err(SolverError::Timeout { ... }), Err(SolverError::Divergence { .. }) => { if !self.handle_divergence() { return Err(...); } } other => return other, } } } ``` ### Architecture Compliance - **NewType pattern:** Use `Pressure`, `Temperature` from core where applicable - **No bare f64** in public API where physical meaning exists - **tracing:** Use `tracing::warn!` for switches, `tracing::info!` for decisions - **Result:** All fallible operations return `Result` - **approx:** Use `assert_relative_eq!` in tests for floating-point comparisons - **Pre-allocation:** All buffers allocated once before fallback loop ### Library/Framework Requirements - **thiserror** — Error enum derive (already in solver) - **tracing** — Structured logging (already in solver) - **std::time::Instant** — Timeout enforcement across switches ### File Structure Requirements **Modified files:** - `crates/solver/src/solver.rs` — Add `FallbackConfig`, `FallbackSolver`, implement `Solver` trait **Tests:** - Unit tests in `solver.rs` (fallback logic, oscillation prevention, timeout) - Integration tests in `tests/` directory (full system solving with fallback) ### Testing Requirements **Unit Tests:** - FallbackConfig defaults are sensible - Newton diverges → Picard converges - Oscillation prevention triggers at max switches - Fallback disabled behaves as pure solver - Timeout applies across switches **Integration Tests:** - Stiff system where Newton diverges but Picard converges - System where Picard stabilizes and Newton returns - System that oscillates and gets stuck on Picard - Compare iteration counts: Newton-only vs Fallback **Performance Tests:** - No heap allocation during solver switches - Convergence time < 1s for standard cycle (NFR1) ### Previous Story Intelligence (4.3) **Picard Implementation Complete:** - `PicardConfig::solve()` fully implemented with all features - Pre-allocated buffers pattern established - Timeout enforcement via `std::time::Instant` - Divergence detection (5 consecutive increases) - Relaxation factor for stability - 37 unit tests in solver.rs, 29 integration tests **Key Patterns to Follow:** - Use `residual_norm()` helper for L2 norm calculation - Use `check_divergence()` pattern with patience parameter - Use `tracing::debug!` for iteration logging - Use `tracing::info!` for convergence events - Return `ConvergedState::new()` on success **Fallback-Specific Considerations:** - Track state across solver invocations - Preserve system state between switches - Log all decisions for debugging - Handle partial convergence gracefully ### Git Intelligence Recent commits show: - `be70a7a` — feat(core): implement physical types with NewType pattern - Epic 1-3 complete (components, fluids, topology) - Story 4.1 complete (Solver trait abstraction) - Story 4.2 complete (Newton-Raphson implementation) - Story 4.3 complete (Sequential Substitution implementation) - Ready for Intelligent Fallback implementation ### Project Context Reference - **FR16:** [Source: epics.md — Solver automatically switches to Sequential Substitution if Newton-Raphson diverges] - **FR17:** [Source: epics.md — Solver respects configurable time budget (timeout)] - **FR18:** [Source: epics.md — On timeout, solver returns best known state with NonConverged status] - **FR20:** [Source: epics.md — Convergence criterion checks Delta Pressure < 1 Pa (1e-5 bar)] - **NFR1:** [Source: prd.md — Steady State convergence time < 1 second for standard cycle in Cold Start] - **NFR4:** [Source: prd.md — No dynamic allocation in solver loop (pre-calculated allocation only)] - **Solver Architecture:** [Source: architecture.md — Trait-based static polymorphism with enum dispatch] - **Error Handling:** [Source: architecture.md — Centralized error enum with thiserror] ### Story Completion Status - **Status:** ready-for-dev - **Completion note:** Ultimate context engine analysis completed — comprehensive developer guide created ## Change Log - 2026-02-18: Story 4.4 created from create-story workflow. Ready for dev. - 2026-02-18: Story 4.4 implementation complete. All tasks done, tests passing. - 2026-02-18: Code review completed. Fixed HIGH issues: AC #2 Newton return logic, AC #3 max switches behavior, Newton re-divergence handling. Fixed MEDIUM issues: Config cloning optimization, improved oscillation prevention tests. ## Dev Agent Record ### Agent Model Used Claude 3.5 Sonnet (claude-3-5-sonnet) ### Debug Log References No blocking issues encountered during implementation. ### Completion Notes List - ✅ Implemented `FallbackConfig` struct with all required fields and `Default` trait - ✅ Implemented `FallbackSolver` struct wrapping `NewtonConfig` and `PicardConfig` - ✅ Implemented intelligent fallback algorithm with state tracking - ✅ Newton → Picard switch on divergence with `tracing::warn!` logging - ✅ Picard → Newton return when residual below threshold with `tracing::info!` logging - ✅ Oscillation prevention via `max_fallback_switches` configuration - ✅ Timeout enforcement across solver switches (remaining time budget) - ✅ Pre-allocated buffers in underlying solvers (no heap allocation during switches) - ✅ Implemented `Solver` trait for `FallbackSolver` with `solve()` and `with_timeout()` - ✅ Added 12 unit tests for FallbackConfig and FallbackSolver - ✅ Added 16 integration tests covering all acceptance criteria - ✅ All 109 unit tests + 16 integration tests + 13 doc tests pass ### File List **Modified:** - `crates/solver/src/solver.rs` — Added `FallbackConfig`, `FallbackSolver`, `CurrentSolver` enum, `FallbackState` struct, and `Solver` trait implementation **Created:** - `crates/solver/tests/fallback_solver.rs` — Integration tests for FallbackSolver **Updated:** - `_bmad-output/implementation-artifacts/sprint-status.yaml` — Updated story status to "in-progress" then "review"