Entropyk/_bmad-output/implementation-artifacts/4-5-time-budgeted-solving.md

# Story 4.5: Time-Budgeted Solving

Status: done

<!-- Note: Validation is optional. Run validate-create-story for quality check before dev-story. -->

## Story

As a HIL engineer (Sarah),
I want strict timeout with graceful degradation,
so that real-time constraints are never violated.

## Acceptance Criteria

1. **Strict Timeout Enforcement** (AC: #1)
   - Given solver with timeout = 1000ms
   - When time budget exceeded
   - Then solver stops immediately (no iteration continues past timeout)
   - And timeout is checked at each iteration start

2. **Best State Return on Timeout** (AC: #2)
   - Given solver that times out
   - When returning from timeout
   - Then returns `ConvergedState` with `status = TimedOutWithBestState`
   - And `state` contains the best-known state (lowest residual norm encountered)
   - And `iterations` contains the count of completed iterations
   - And `final_residual` contains the best residual norm

3. **HIL Zero-Order Hold (ZOH) Support** (AC: #3)
   - Given HIL scenario with previous state available
   - When timeout occurs
   - Then solver can optionally return previous state instead of current best
   - And `zoh_fallback: bool` config option controls this behavior

4. **Timeout Across Fallback Switches** (AC: #4)
   - Given `FallbackSolver` with timeout configured
   - When fallback occurs between Newton and Picard
   - Then timeout applies to total solving time (already implemented in Story 4.4)
   - And best state is preserved across solver switches

5. **Pre-Allocated Buffers** (AC: #5)
   - Given a finalized `System`
   - When the solver initializes
   - Then all buffers for tracking best state are pre-allocated
   - And no heap allocation occurs during iteration loop

6. **Configurable Timeout Behavior** (AC: #6)
   - Given `TimeoutConfig` struct
   - When setting `return_best_state_on_timeout: false`
   - Then solver returns `SolverError::Timeout` instead of `ConvergedState`
   - And `zoh_fallback` and `return_best_state_on_timeout` are configurable

## Tasks / Subtasks

- [ ] Implement `TimeoutConfig` struct in `crates/solver/src/solver.rs` (AC: #6)
  - [ ] Add `return_best_state_on_timeout: bool` (default: true)
  - [ ] Add `zoh_fallback: bool` (default: false)
  - [ ] Implement `Default` trait

- [ ] Add best-state tracking to `NewtonConfig` (AC: #1, #2, #5)
  - [ ] Add `best_state: Vec<f64>` pre-allocated buffer
  - [ ] Add `best_residual: f64` tracking variable
  - [ ] Update best state when residual improves
  - [ ] Return `ConvergedState` with `TimedOutWithBestState` on timeout

- [ ] Add best-state tracking to `PicardConfig` (AC: #1, #2, #5)
  - [ ] Add `best_state: Vec<f64>` pre-allocated buffer
  - [ ] Add `best_residual: f64` tracking variable
  - [ ] Update best state when residual improves
  - [ ] Return `ConvergedState` with `TimedOutWithBestState` on timeout

- [ ] Update `FallbackSolver` for best-state preservation (AC: #4)
  - [ ] Track best state across solver switches
  - [ ] Return best state on timeout regardless of which solver was active

- [ ] Implement ZOH fallback support (AC: #3)
  - [ ] Add `previous_state: Option<Vec<f64>>` to solver configs
  - [ ] On timeout with `zoh_fallback: true`, return previous state if available

- [ ] Integration tests (AC: #1-#6)
  - [ ] Test timeout returns best state (not error)
  - [ ] Test best state is actually the lowest residual encountered
  - [ ] Test ZOH fallback returns previous state
  - [ ] Test timeout behavior with `return_best_state_on_timeout: false`
  - [ ] Test timeout across fallback switches preserves best state
  - [ ] Test no heap allocation during iteration with best-state tracking

## Dev Notes

### Epic Context

**Epic 4: Intelligent Solver Engine** — Solve any system with < 1s guarantee, Newton-Raphson ↔ Sequential Substitution fallback.

**Story Dependencies:**
- **Story 4.1 (Solver Trait Abstraction)** — DONE: `Solver` trait, `SolverError`, `ConvergedState` defined
- **Story 4.2 (Newton-Raphson Implementation)** — DONE: Full Newton-Raphson with line search, timeout, divergence detection
- **Story 4.3 (Sequential Substitution)** — DONE: Picard implementation with relaxation, timeout, divergence detection
- **Story 4.4 (Intelligent Fallback Strategy)** — DONE: FallbackSolver with timeout across switches
- **Story 4.6 (Smart Initialization Heuristic)** — NEXT: Automatic initial guesses from temperatures

**FRs covered:** FR17 (configurable timeout), FR18 (best state on timeout), FR20 (convergence criterion)

### Architecture Context

**Technical Stack:**
- `thiserror` for error handling (already in solver)
- `tracing` for observability (already in solver)
- `std::time::Instant` for timeout enforcement

**Code Structure:**
- `crates/solver/src/solver.rs` — NewtonConfig, PicardConfig, FallbackSolver modifications
- `crates/solver/src/system.rs` — EXISTING: `System` with `compute_residuals()`

**Relevant Architecture Decisions:**
- **No allocation in hot path:** Pre-allocate best-state buffers before iteration loop [Source: architecture.md]
- **Error Handling:** Centralized error enum with `thiserror` [Source: architecture.md]
- **Zero-panic policy:** All operations return `Result` [Source: architecture.md]
- **HIL latency < 20ms:** Real-time constraints must be respected [Source: prd.md NFR6]

### Developer Context

**Existing Implementation (Story 4.1 + 4.2 + 4.3 + 4.4):**

```rust
// crates/solver/src/solver.rs - EXISTING

pub enum ConvergenceStatus {
    Converged,
    TimedOutWithBestState,  // Already defined for this story!
}

pub struct ConvergedState {
    pub state: Vec<f64>,
    pub iterations: usize,
    pub final_residual: f64,
    pub status: ConvergenceStatus,
}

// Current timeout behavior (Story 4.2/4.3):
// Returns Err(SolverError::Timeout { timeout_ms }) on timeout
// This story changes it to return Ok(ConvergedState { status: TimedOutWithBestState })
```

**Current Timeout Implementation:**
```rust
// In NewtonConfig::solve() and PicardConfig::solve()
if let Some(timeout) = self.timeout {
    if start_time.elapsed() > timeout {
        tracing::info!(...);
        return Err(SolverError::Timeout { timeout_ms: ... });
    }
}
```

**What Needs to Change:**
1. Track best state during iteration (pre-allocated buffer)
2. On timeout, return `Ok(ConvergedState { status: TimedOutWithBestState, ... })`
3. Make this behavior configurable via `TimeoutConfig`

### Technical Requirements

**Best-State Tracking Algorithm:**

```
Input: System, timeout
Output: ConvergedState (Converged or TimedOutWithBestState)

1. Initialize:
   - best_state = pre-allocated buffer (copy of initial state)
   - best_residual = initial residual norm
   - start_time = Instant::now()

2. Each iteration:
   a. Check timeout BEFORE starting iteration
   b. Compute residuals and update state
   c. If new residual < best_residual:
      - Copy current state to best_state
      - Update best_residual = new residual
   d. Check convergence

3. On timeout:
   - If return_best_state_on_timeout:
     - Return Ok(ConvergedState {
         state: best_state,
         iterations: completed_iterations,
         final_residual: best_residual,
         status: TimedOutWithBestState,
       })
   - Else:
     - Return Err(SolverError::Timeout { timeout_ms })
```

**Key Design Decisions:**

| Decision | Rationale |
|----------|-----------|
| Check timeout at iteration start | Guarantees no iteration exceeds budget |
| Pre-allocate best_state buffer | No heap allocation in hot path (NFR4) |
| Track best residual, not latest | Best state is more useful for HIL |
| Configurable return behavior | Some users prefer error on timeout |
| ZOH fallback optional | HIL-specific feature, not always needed |

**TimeoutConfig Structure:**

```rust
pub struct TimeoutConfig {
    /// Return best-known state on timeout instead of error.
    /// Default: true (graceful degradation for HIL)
    pub return_best_state_on_timeout: bool,

    /// On timeout, return previous state (ZOH) instead of current best.
    /// Requires `previous_state` to be set before solving.
    /// Default: false
    pub zoh_fallback: bool,
}
```

**Integration with Existing Configs:**

```rust
pub struct NewtonConfig {
    // ... existing fields ...
    pub timeout: Option<Duration>,

    // NEW: Timeout behavior configuration
    pub timeout_config: TimeoutConfig,

    // NEW: Pre-allocated buffer for best state tracking
    // (allocated once in solve(), not stored in config)
}

pub struct PicardConfig {
    // ... existing fields ...
    pub timeout: Option<Duration>,

    // NEW: Timeout behavior configuration
    pub timeout_config: TimeoutConfig,
}
```

**ZOH (Zero-Order Hold) for HIL:**

```rust
impl NewtonConfig {
    /// Set previous state for ZOH fallback on timeout.
    pub fn with_previous_state(mut self, state: Vec<f64>) -> Self {
        self.previous_state = Some(state);
        self
    }

    // In solve():
    // On timeout with zoh_fallback=true and previous_state available:
    // Return previous_state instead of best_state
}
```

### Architecture Compliance

- **NewType pattern:** Use `Pressure`, `Temperature` from core where applicable
- **No bare f64** in public API where physical meaning exists
- **tracing:** Use `tracing::info!` for timeout events, `tracing::debug!` for best-state updates
- **Result<T, E>:** On timeout with `return_best_state_on_timeout: true`, return `Ok(ConvergedState)`
- **approx:** Use `assert_relative_eq!` in tests for floating-point comparisons
- **Pre-allocation:** Best-state buffer allocated once before iteration loop

### Library/Framework Requirements

- **thiserror** — Error enum derive (already in solver)
- **tracing** — Structured logging (already in solver)
- **std::time::Instant** — Timeout enforcement

### File Structure Requirements

**Modified files:**
- `crates/solver/src/solver.rs` — Add `TimeoutConfig`, modify `NewtonConfig`, `PicardConfig`, `FallbackSolver`

**Tests:**
- Unit tests in `solver.rs` (timeout behavior, best-state tracking, ZOH fallback)
- Integration tests in `tests/` directory (full system solving with timeout)

### Testing Requirements

**Unit Tests:**
- TimeoutConfig defaults are sensible
- Best state is tracked correctly during iteration
- Timeout returns `ConvergedState` with `TimedOutWithBestState`
- ZOH fallback returns previous state when configured
- `return_best_state_on_timeout: false` returns error on timeout

**Integration Tests:**
- System that times out returns best state (not error)
- Best state has lower residual than initial state
- Timeout across fallback switches preserves best state
- HIL scenario: ZOH fallback returns previous state

**Performance Tests:**
- No heap allocation during iteration with best-state tracking
- Timeout check overhead is negligible (< 1μs per check)

### Previous Story Intelligence (4.4)

**FallbackSolver Implementation Complete:**
- `FallbackConfig` with `fallback_enabled`, `return_to_newton_threshold`, `max_fallback_switches`
- `FallbackSolver` wrapping `NewtonConfig` and `PicardConfig`
- Timeout applies to total solving time across switches
- Pre-allocated buffers pattern established

**Key Patterns to Follow:**
- Use `residual_norm()` helper for L2 norm calculation
- Use `tracing::debug!` for iteration logging
- Use `tracing::info!` for timeout events
- Return `ConvergedState::new()` on success

**Best-State Tracking Considerations:**
- Track best state in FallbackSolver across solver switches
- Each underlying solver (Newton/Picard) tracks its own best state
- FallbackSolver preserves best state when switching

### Git Intelligence

Recent commits show:
- `be70a7a` — feat(core): implement physical types with NewType pattern
- Epic 1-3 complete (components, fluids, topology)
- Story 4.1-4.4 complete (Solver trait, Newton, Picard, Fallback)
- Ready for Time-Budgeted Solving implementation

### Project Context Reference

- **FR17:** [Source: epics.md — Solver respects configurable time budget (timeout)]
- **FR18:** [Source: epics.md — On timeout, solver returns best known state with NonConverged status]
- **FR20:** [Source: epics.md — Convergence criterion checks Delta Pressure < 1 Pa (1e-5 bar)]
- **NFR1:** [Source: prd.md — Steady State convergence time < 1 second for standard cycle in Cold Start]
- **NFR4:** [Source: prd.md — No dynamic allocation in solver loop (pre-calculated allocation only)]
- **NFR6:** [Source: prd.md — HIL latency < 20 ms for real-time integration with PLC]
- **NFR10:** [Source: prd.md — Graceful error handling: timeout, non-convergence, saturation return explicit Result<T, Error>]
- **Solver Architecture:** [Source: architecture.md — Trait-based static polymorphism with enum dispatch]
- **Error Handling:** [Source: architecture.md — Centralized error enum with thiserror]

### Story Completion Status

- **Status:** ready-for-dev
- **Completion note:** Ultimate context engine analysis completed — comprehensive developer guide created