Entropyk/_bmad-output/implementation-artifacts/4-5-time-budgeted-solving.md

342 lines
13 KiB
Markdown

# Story 4.5: Time-Budgeted Solving
Status: done
<!-- Note: Validation is optional. Run validate-create-story for quality check before dev-story. -->
## Story
As a HIL engineer (Sarah),
I want strict timeout with graceful degradation,
so that real-time constraints are never violated.
## Acceptance Criteria
1. **Strict Timeout Enforcement** (AC: #1)
- Given solver with timeout = 1000ms
- When time budget exceeded
- Then solver stops immediately (no iteration continues past timeout)
- And timeout is checked at each iteration start
2. **Best State Return on Timeout** (AC: #2)
- Given solver that times out
- When returning from timeout
- Then returns `ConvergedState` with `status = TimedOutWithBestState`
- And `state` contains the best-known state (lowest residual norm encountered)
- And `iterations` contains the count of completed iterations
- And `final_residual` contains the best residual norm
3. **HIL Zero-Order Hold (ZOH) Support** (AC: #3)
- Given HIL scenario with previous state available
- When timeout occurs
- Then solver can optionally return previous state instead of current best
- And `zoh_fallback: bool` config option controls this behavior
4. **Timeout Across Fallback Switches** (AC: #4)
- Given `FallbackSolver` with timeout configured
- When fallback occurs between Newton and Picard
- Then timeout applies to total solving time (already implemented in Story 4.4)
- And best state is preserved across solver switches
5. **Pre-Allocated Buffers** (AC: #5)
- Given a finalized `System`
- When the solver initializes
- Then all buffers for tracking best state are pre-allocated
- And no heap allocation occurs during iteration loop
6. **Configurable Timeout Behavior** (AC: #6)
- Given `TimeoutConfig` struct
- When setting `return_best_state_on_timeout: false`
- Then solver returns `SolverError::Timeout` instead of `ConvergedState`
- And `zoh_fallback` and `return_best_state_on_timeout` are configurable
## Tasks / Subtasks
- [ ] Implement `TimeoutConfig` struct in `crates/solver/src/solver.rs` (AC: #6)
- [ ] Add `return_best_state_on_timeout: bool` (default: true)
- [ ] Add `zoh_fallback: bool` (default: false)
- [ ] Implement `Default` trait
- [ ] Add best-state tracking to `NewtonConfig` (AC: #1, #2, #5)
- [ ] Add `best_state: Vec<f64>` pre-allocated buffer
- [ ] Add `best_residual: f64` tracking variable
- [ ] Update best state when residual improves
- [ ] Return `ConvergedState` with `TimedOutWithBestState` on timeout
- [ ] Add best-state tracking to `PicardConfig` (AC: #1, #2, #5)
- [ ] Add `best_state: Vec<f64>` pre-allocated buffer
- [ ] Add `best_residual: f64` tracking variable
- [ ] Update best state when residual improves
- [ ] Return `ConvergedState` with `TimedOutWithBestState` on timeout
- [ ] Update `FallbackSolver` for best-state preservation (AC: #4)
- [ ] Track best state across solver switches
- [ ] Return best state on timeout regardless of which solver was active
- [ ] Implement ZOH fallback support (AC: #3)
- [ ] Add `previous_state: Option<Vec<f64>>` to solver configs
- [ ] On timeout with `zoh_fallback: true`, return previous state if available
- [ ] Integration tests (AC: #1-#6)
- [ ] Test timeout returns best state (not error)
- [ ] Test best state is actually the lowest residual encountered
- [ ] Test ZOH fallback returns previous state
- [ ] Test timeout behavior with `return_best_state_on_timeout: false`
- [ ] Test timeout across fallback switches preserves best state
- [ ] Test no heap allocation during iteration with best-state tracking
## Dev Notes
### Epic Context
**Epic 4: Intelligent Solver Engine** — Solve any system with < 1s guarantee, Newton-Raphson Sequential Substitution fallback.
**Story Dependencies:**
- **Story 4.1 (Solver Trait Abstraction)** DONE: `Solver` trait, `SolverError`, `ConvergedState` defined
- **Story 4.2 (Newton-Raphson Implementation)** DONE: Full Newton-Raphson with line search, timeout, divergence detection
- **Story 4.3 (Sequential Substitution)** DONE: Picard implementation with relaxation, timeout, divergence detection
- **Story 4.4 (Intelligent Fallback Strategy)** DONE: FallbackSolver with timeout across switches
- **Story 4.6 (Smart Initialization Heuristic)** NEXT: Automatic initial guesses from temperatures
**FRs covered:** FR17 (configurable timeout), FR18 (best state on timeout), FR20 (convergence criterion)
### Architecture Context
**Technical Stack:**
- `thiserror` for error handling (already in solver)
- `tracing` for observability (already in solver)
- `std::time::Instant` for timeout enforcement
**Code Structure:**
- `crates/solver/src/solver.rs` NewtonConfig, PicardConfig, FallbackSolver modifications
- `crates/solver/src/system.rs` EXISTING: `System` with `compute_residuals()`
**Relevant Architecture Decisions:**
- **No allocation in hot path:** Pre-allocate best-state buffers before iteration loop [Source: architecture.md]
- **Error Handling:** Centralized error enum with `thiserror` [Source: architecture.md]
- **Zero-panic policy:** All operations return `Result` [Source: architecture.md]
- **HIL latency < 20ms:** Real-time constraints must be respected [Source: prd.md NFR6]
### Developer Context
**Existing Implementation (Story 4.1 + 4.2 + 4.3 + 4.4):**
```rust
// crates/solver/src/solver.rs - EXISTING
pub enum ConvergenceStatus {
Converged,
TimedOutWithBestState, // Already defined for this story!
}
pub struct ConvergedState {
pub state: Vec<f64>,
pub iterations: usize,
pub final_residual: f64,
pub status: ConvergenceStatus,
}
// Current timeout behavior (Story 4.2/4.3):
// Returns Err(SolverError::Timeout { timeout_ms }) on timeout
// This story changes it to return Ok(ConvergedState { status: TimedOutWithBestState })
```
**Current Timeout Implementation:**
```rust
// In NewtonConfig::solve() and PicardConfig::solve()
if let Some(timeout) = self.timeout {
if start_time.elapsed() > timeout {
tracing::info!(...);
return Err(SolverError::Timeout { timeout_ms: ... });
}
}
```
**What Needs to Change:**
1. Track best state during iteration (pre-allocated buffer)
2. On timeout, return `Ok(ConvergedState { status: TimedOutWithBestState, ... })`
3. Make this behavior configurable via `TimeoutConfig`
### Technical Requirements
**Best-State Tracking Algorithm:**
```
Input: System, timeout
Output: ConvergedState (Converged or TimedOutWithBestState)
1. Initialize:
- best_state = pre-allocated buffer (copy of initial state)
- best_residual = initial residual norm
- start_time = Instant::now()
2. Each iteration:
a. Check timeout BEFORE starting iteration
b. Compute residuals and update state
c. If new residual < best_residual:
- Copy current state to best_state
- Update best_residual = new residual
d. Check convergence
3. On timeout:
- If return_best_state_on_timeout:
- Return Ok(ConvergedState {
state: best_state,
iterations: completed_iterations,
final_residual: best_residual,
status: TimedOutWithBestState,
})
- Else:
- Return Err(SolverError::Timeout { timeout_ms })
```
**Key Design Decisions:**
| Decision | Rationale |
|----------|-----------|
| Check timeout at iteration start | Guarantees no iteration exceeds budget |
| Pre-allocate best_state buffer | No heap allocation in hot path (NFR4) |
| Track best residual, not latest | Best state is more useful for HIL |
| Configurable return behavior | Some users prefer error on timeout |
| ZOH fallback optional | HIL-specific feature, not always needed |
**TimeoutConfig Structure:**
```rust
pub struct TimeoutConfig {
/// Return best-known state on timeout instead of error.
/// Default: true (graceful degradation for HIL)
pub return_best_state_on_timeout: bool,
/// On timeout, return previous state (ZOH) instead of current best.
/// Requires `previous_state` to be set before solving.
/// Default: false
pub zoh_fallback: bool,
}
```
**Integration with Existing Configs:**
```rust
pub struct NewtonConfig {
// ... existing fields ...
pub timeout: Option<Duration>,
// NEW: Timeout behavior configuration
pub timeout_config: TimeoutConfig,
// NEW: Pre-allocated buffer for best state tracking
// (allocated once in solve(), not stored in config)
}
pub struct PicardConfig {
// ... existing fields ...
pub timeout: Option<Duration>,
// NEW: Timeout behavior configuration
pub timeout_config: TimeoutConfig,
}
```
**ZOH (Zero-Order Hold) for HIL:**
```rust
impl NewtonConfig {
/// Set previous state for ZOH fallback on timeout.
pub fn with_previous_state(mut self, state: Vec<f64>) -> Self {
self.previous_state = Some(state);
self
}
// In solve():
// On timeout with zoh_fallback=true and previous_state available:
// Return previous_state instead of best_state
}
```
### Architecture Compliance
- **NewType pattern:** Use `Pressure`, `Temperature` from core where applicable
- **No bare f64** in public API where physical meaning exists
- **tracing:** Use `tracing::info!` for timeout events, `tracing::debug!` for best-state updates
- **Result<T, E>:** On timeout with `return_best_state_on_timeout: true`, return `Ok(ConvergedState)`
- **approx:** Use `assert_relative_eq!` in tests for floating-point comparisons
- **Pre-allocation:** Best-state buffer allocated once before iteration loop
### Library/Framework Requirements
- **thiserror** — Error enum derive (already in solver)
- **tracing** — Structured logging (already in solver)
- **std::time::Instant** — Timeout enforcement
### File Structure Requirements
**Modified files:**
- `crates/solver/src/solver.rs` — Add `TimeoutConfig`, modify `NewtonConfig`, `PicardConfig`, `FallbackSolver`
**Tests:**
- Unit tests in `solver.rs` (timeout behavior, best-state tracking, ZOH fallback)
- Integration tests in `tests/` directory (full system solving with timeout)
### Testing Requirements
**Unit Tests:**
- TimeoutConfig defaults are sensible
- Best state is tracked correctly during iteration
- Timeout returns `ConvergedState` with `TimedOutWithBestState`
- ZOH fallback returns previous state when configured
- `return_best_state_on_timeout: false` returns error on timeout
**Integration Tests:**
- System that times out returns best state (not error)
- Best state has lower residual than initial state
- Timeout across fallback switches preserves best state
- HIL scenario: ZOH fallback returns previous state
**Performance Tests:**
- No heap allocation during iteration with best-state tracking
- Timeout check overhead is negligible (< 1μs per check)
### Previous Story Intelligence (4.4)
**FallbackSolver Implementation Complete:**
- `FallbackConfig` with `fallback_enabled`, `return_to_newton_threshold`, `max_fallback_switches`
- `FallbackSolver` wrapping `NewtonConfig` and `PicardConfig`
- Timeout applies to total solving time across switches
- Pre-allocated buffers pattern established
**Key Patterns to Follow:**
- Use `residual_norm()` helper for L2 norm calculation
- Use `tracing::debug!` for iteration logging
- Use `tracing::info!` for timeout events
- Return `ConvergedState::new()` on success
**Best-State Tracking Considerations:**
- Track best state in FallbackSolver across solver switches
- Each underlying solver (Newton/Picard) tracks its own best state
- FallbackSolver preserves best state when switching
### Git Intelligence
Recent commits show:
- `be70a7a` feat(core): implement physical types with NewType pattern
- Epic 1-3 complete (components, fluids, topology)
- Story 4.1-4.4 complete (Solver trait, Newton, Picard, Fallback)
- Ready for Time-Budgeted Solving implementation
### Project Context Reference
- **FR17:** [Source: epics.md Solver respects configurable time budget (timeout)]
- **FR18:** [Source: epics.md On timeout, solver returns best known state with NonConverged status]
- **FR20:** [Source: epics.md Convergence criterion checks Delta Pressure < 1 Pa (1e-5 bar)]
- **NFR1:** [Source: prd.md Steady State convergence time < 1 second for standard cycle in Cold Start]
- **NFR4:** [Source: prd.md No dynamic allocation in solver loop (pre-calculated allocation only)]
- **NFR6:** [Source: prd.md HIL latency < 20 ms for real-time integration with PLC]
- **NFR10:** [Source: prd.md Graceful error handling: timeout, non-convergence, saturation return explicit Result<T, Error>]
- **Solver Architecture:** [Source: architecture.md — Trait-based static polymorphism with enum dispatch]
- **Error Handling:** [Source: architecture.md — Centralized error enum with thiserror]
### Story Completion Status
- **Status:** ready-for-dev
- **Completion note:** Ultimate context engine analysis completed — comprehensive developer guide created