Entropyk/_bmad-output/implementation-artifacts/4-5-time-budgeted-solving.md

13 KiB

Story 4.5: Time-Budgeted Solving

Status: done

Story

As a HIL engineer (Sarah), I want strict timeout with graceful degradation, so that real-time constraints are never violated.

Acceptance Criteria

  1. Strict Timeout Enforcement (AC: #1)

    • Given solver with timeout = 1000ms
    • When time budget exceeded
    • Then solver stops immediately (no iteration continues past timeout)
    • And timeout is checked at each iteration start
  2. Best State Return on Timeout (AC: #2)

    • Given solver that times out
    • When returning from timeout
    • Then returns ConvergedState with status = TimedOutWithBestState
    • And state contains the best-known state (lowest residual norm encountered)
    • And iterations contains the count of completed iterations
    • And final_residual contains the best residual norm
  3. HIL Zero-Order Hold (ZOH) Support (AC: #3)

    • Given HIL scenario with previous state available
    • When timeout occurs
    • Then solver can optionally return previous state instead of current best
    • And zoh_fallback: bool config option controls this behavior
  4. Timeout Across Fallback Switches (AC: #4)

    • Given FallbackSolver with timeout configured
    • When fallback occurs between Newton and Picard
    • Then timeout applies to total solving time (already implemented in Story 4.4)
    • And best state is preserved across solver switches
  5. Pre-Allocated Buffers (AC: #5)

    • Given a finalized System
    • When the solver initializes
    • Then all buffers for tracking best state are pre-allocated
    • And no heap allocation occurs during iteration loop
  6. Configurable Timeout Behavior (AC: #6)

    • Given TimeoutConfig struct
    • When setting return_best_state_on_timeout: false
    • Then solver returns SolverError::Timeout instead of ConvergedState
    • And zoh_fallback and return_best_state_on_timeout are configurable

Tasks / Subtasks

  • Implement TimeoutConfig struct in crates/solver/src/solver.rs (AC: #6)

    • Add return_best_state_on_timeout: bool (default: true)
    • Add zoh_fallback: bool (default: false)
    • Implement Default trait
  • Add best-state tracking to NewtonConfig (AC: #1, #2, #5)

    • Add best_state: Vec<f64> pre-allocated buffer
    • Add best_residual: f64 tracking variable
    • Update best state when residual improves
    • Return ConvergedState with TimedOutWithBestState on timeout
  • Add best-state tracking to PicardConfig (AC: #1, #2, #5)

    • Add best_state: Vec<f64> pre-allocated buffer
    • Add best_residual: f64 tracking variable
    • Update best state when residual improves
    • Return ConvergedState with TimedOutWithBestState on timeout
  • Update FallbackSolver for best-state preservation (AC: #4)

    • Track best state across solver switches
    • Return best state on timeout regardless of which solver was active
  • Implement ZOH fallback support (AC: #3)

    • Add previous_state: Option<Vec<f64>> to solver configs
    • On timeout with zoh_fallback: true, return previous state if available
  • Integration tests (AC: #1-#6)

    • Test timeout returns best state (not error)
    • Test best state is actually the lowest residual encountered
    • Test ZOH fallback returns previous state
    • Test timeout behavior with return_best_state_on_timeout: false
    • Test timeout across fallback switches preserves best state
    • Test no heap allocation during iteration with best-state tracking

Dev Notes

Epic Context

Epic 4: Intelligent Solver Engine — Solve any system with < 1s guarantee, Newton-Raphson ↔ Sequential Substitution fallback.

Story Dependencies:

  • Story 4.1 (Solver Trait Abstraction) — DONE: Solver trait, SolverError, ConvergedState defined
  • Story 4.2 (Newton-Raphson Implementation) — DONE: Full Newton-Raphson with line search, timeout, divergence detection
  • Story 4.3 (Sequential Substitution) — DONE: Picard implementation with relaxation, timeout, divergence detection
  • Story 4.4 (Intelligent Fallback Strategy) — DONE: FallbackSolver with timeout across switches
  • Story 4.6 (Smart Initialization Heuristic) — NEXT: Automatic initial guesses from temperatures

FRs covered: FR17 (configurable timeout), FR18 (best state on timeout), FR20 (convergence criterion)

Architecture Context

Technical Stack:

  • thiserror for error handling (already in solver)
  • tracing for observability (already in solver)
  • std::time::Instant for timeout enforcement

Code Structure:

  • crates/solver/src/solver.rs — NewtonConfig, PicardConfig, FallbackSolver modifications
  • crates/solver/src/system.rs — EXISTING: System with compute_residuals()

Relevant Architecture Decisions:

  • No allocation in hot path: Pre-allocate best-state buffers before iteration loop [Source: architecture.md]
  • Error Handling: Centralized error enum with thiserror [Source: architecture.md]
  • Zero-panic policy: All operations return Result [Source: architecture.md]
  • HIL latency < 20ms: Real-time constraints must be respected [Source: prd.md NFR6]

Developer Context

Existing Implementation (Story 4.1 + 4.2 + 4.3 + 4.4):

// crates/solver/src/solver.rs - EXISTING

pub enum ConvergenceStatus {
    Converged,
    TimedOutWithBestState,  // Already defined for this story!
}

pub struct ConvergedState {
    pub state: Vec<f64>,
    pub iterations: usize,
    pub final_residual: f64,
    pub status: ConvergenceStatus,
}

// Current timeout behavior (Story 4.2/4.3):
// Returns Err(SolverError::Timeout { timeout_ms }) on timeout
// This story changes it to return Ok(ConvergedState { status: TimedOutWithBestState })

Current Timeout Implementation:

// In NewtonConfig::solve() and PicardConfig::solve()
if let Some(timeout) = self.timeout {
    if start_time.elapsed() > timeout {
        tracing::info!(...);
        return Err(SolverError::Timeout { timeout_ms: ... });
    }
}

What Needs to Change:

  1. Track best state during iteration (pre-allocated buffer)
  2. On timeout, return Ok(ConvergedState { status: TimedOutWithBestState, ... })
  3. Make this behavior configurable via TimeoutConfig

Technical Requirements

Best-State Tracking Algorithm:

Input: System, timeout
Output: ConvergedState (Converged or TimedOutWithBestState)

1. Initialize:
   - best_state = pre-allocated buffer (copy of initial state)
   - best_residual = initial residual norm
   - start_time = Instant::now()

2. Each iteration:
   a. Check timeout BEFORE starting iteration
   b. Compute residuals and update state
   c. If new residual < best_residual:
      - Copy current state to best_state
      - Update best_residual = new residual
   d. Check convergence

3. On timeout:
   - If return_best_state_on_timeout:
     - Return Ok(ConvergedState {
         state: best_state,
         iterations: completed_iterations,
         final_residual: best_residual,
         status: TimedOutWithBestState,
       })
   - Else:
     - Return Err(SolverError::Timeout { timeout_ms })

Key Design Decisions:

Decision Rationale
Check timeout at iteration start Guarantees no iteration exceeds budget
Pre-allocate best_state buffer No heap allocation in hot path (NFR4)
Track best residual, not latest Best state is more useful for HIL
Configurable return behavior Some users prefer error on timeout
ZOH fallback optional HIL-specific feature, not always needed

TimeoutConfig Structure:

pub struct TimeoutConfig {
    /// Return best-known state on timeout instead of error.
    /// Default: true (graceful degradation for HIL)
    pub return_best_state_on_timeout: bool,
    
    /// On timeout, return previous state (ZOH) instead of current best.
    /// Requires `previous_state` to be set before solving.
    /// Default: false
    pub zoh_fallback: bool,
}

Integration with Existing Configs:

pub struct NewtonConfig {
    // ... existing fields ...
    pub timeout: Option<Duration>,
    
    // NEW: Timeout behavior configuration
    pub timeout_config: TimeoutConfig,
    
    // NEW: Pre-allocated buffer for best state tracking
    // (allocated once in solve(), not stored in config)
}

pub struct PicardConfig {
    // ... existing fields ...
    pub timeout: Option<Duration>,
    
    // NEW: Timeout behavior configuration
    pub timeout_config: TimeoutConfig,
}

ZOH (Zero-Order Hold) for HIL:

impl NewtonConfig {
    /// Set previous state for ZOH fallback on timeout.
    pub fn with_previous_state(mut self, state: Vec<f64>) -> Self {
        self.previous_state = Some(state);
        self
    }
    
    // In solve():
    // On timeout with zoh_fallback=true and previous_state available:
    // Return previous_state instead of best_state
}

Architecture Compliance

  • NewType pattern: Use Pressure, Temperature from core where applicable
  • No bare f64 in public API where physical meaning exists
  • tracing: Use tracing::info! for timeout events, tracing::debug! for best-state updates
  • Result<T, E>: On timeout with return_best_state_on_timeout: true, return Ok(ConvergedState)
  • approx: Use assert_relative_eq! in tests for floating-point comparisons
  • Pre-allocation: Best-state buffer allocated once before iteration loop

Library/Framework Requirements

  • thiserror — Error enum derive (already in solver)
  • tracing — Structured logging (already in solver)
  • std::time::Instant — Timeout enforcement

File Structure Requirements

Modified files:

  • crates/solver/src/solver.rs — Add TimeoutConfig, modify NewtonConfig, PicardConfig, FallbackSolver

Tests:

  • Unit tests in solver.rs (timeout behavior, best-state tracking, ZOH fallback)
  • Integration tests in tests/ directory (full system solving with timeout)

Testing Requirements

Unit Tests:

  • TimeoutConfig defaults are sensible
  • Best state is tracked correctly during iteration
  • Timeout returns ConvergedState with TimedOutWithBestState
  • ZOH fallback returns previous state when configured
  • return_best_state_on_timeout: false returns error on timeout

Integration Tests:

  • System that times out returns best state (not error)
  • Best state has lower residual than initial state
  • Timeout across fallback switches preserves best state
  • HIL scenario: ZOH fallback returns previous state

Performance Tests:

  • No heap allocation during iteration with best-state tracking
  • Timeout check overhead is negligible (< 1μs per check)

Previous Story Intelligence (4.4)

FallbackSolver Implementation Complete:

  • FallbackConfig with fallback_enabled, return_to_newton_threshold, max_fallback_switches
  • FallbackSolver wrapping NewtonConfig and PicardConfig
  • Timeout applies to total solving time across switches
  • Pre-allocated buffers pattern established

Key Patterns to Follow:

  • Use residual_norm() helper for L2 norm calculation
  • Use tracing::debug! for iteration logging
  • Use tracing::info! for timeout events
  • Return ConvergedState::new() on success

Best-State Tracking Considerations:

  • Track best state in FallbackSolver across solver switches
  • Each underlying solver (Newton/Picard) tracks its own best state
  • FallbackSolver preserves best state when switching

Git Intelligence

Recent commits show:

  • be70a7a — feat(core): implement physical types with NewType pattern
  • Epic 1-3 complete (components, fluids, topology)
  • Story 4.1-4.4 complete (Solver trait, Newton, Picard, Fallback)
  • Ready for Time-Budgeted Solving implementation

Project Context Reference

  • FR17: [Source: epics.md — Solver respects configurable time budget (timeout)]
  • FR18: [Source: epics.md — On timeout, solver returns best known state with NonConverged status]
  • FR20: [Source: epics.md — Convergence criterion checks Delta Pressure < 1 Pa (1e-5 bar)]
  • NFR1: [Source: prd.md — Steady State convergence time < 1 second for standard cycle in Cold Start]
  • NFR4: [Source: prd.md — No dynamic allocation in solver loop (pre-calculated allocation only)]
  • NFR6: [Source: prd.md — HIL latency < 20 ms for real-time integration with PLC]
  • NFR10: [Source: prd.md — Graceful error handling: timeout, non-convergence, saturation return explicit Result<T, Error>]
  • Solver Architecture: [Source: architecture.md — Trait-based static polymorphism with enum dispatch]
  • Error Handling: [Source: architecture.md — Centralized error enum with thiserror]

Story Completion Status

  • Status: ready-for-dev
  • Completion note: Ultimate context engine analysis completed — comprehensive developer guide created