# Story 2.3: Détection Automatique des Outliers (Backend) Status: review ## Story As a system, I want to identify statistical outliers in the background, so that I can alert the user to potential data quality issues. ## Acceptance Criteria 1. **Algorithm Implementation:** Backend implements Isolation Forest (multivariate) and IQR (univariate) algorithms. 2. **Analysis Endpoint:** A POST endpoint `/api/v1/analysis/detect-outliers` accepts dataset and configuration. 3. **Detection Output:** Returns a list of outlier row indices and the reason for flagging (e.g., "z-score > 3"). 4. **Performance:** Detection on 50k rows completes in under 5 seconds. 5. **Robustness:** Handles missing values (NaNs) gracefully without crashing. ## Tasks / Subtasks - [x] **Dependency Update** (AC: 1) - [x] Add `scikit-learn` to the backend using `uv`. - [x] **Outlier Engine Implementation** (AC: 1, 5) - [x] Create `backend/app/core/engine/clean.py`. - [x] Implement univariate IQR-based detection. - [x] Implement multivariate Isolation Forest detection. - [x] **API Endpoint** (AC: 2, 3, 4) - [x] Implement `POST /api/v1/analysis/detect-outliers` in `analysis.py`. - [x] Map detection results to indexed row references. ## Dev Notes - **Algorithms:** Used Scikit-learn's `IsolationForest` for multivariate and Pandas quantile logic for IQR. - **Explainability:** Each outlier is returned with a descriptive string explaining the reason for the flag. - **Performance:** Asynchronous ready, using standard Scikit-learn optimisations. ### Project Structure Notes - Created `backend/app/core/engine/clean.py` for outlier logic. - Updated `backend/app/api/v1/analysis.py` with the detection endpoint. - Added `backend/tests/test_analysis.py` for verification. ### References - [Source: epics.md#Story 2.3] - [Source: project-context.md#Critical Anti-Patterns] ## Dev Agent Record ### Agent Model Used {{agent_model_name_version}} ### Completion Notes List - Integrated `scikit-learn` for anomaly detection. - Implemented univariate detection based on 1.5 * IQR bounds. - Implemented multivariate detection using the Isolation Forest algorithm. - Developed a robust API endpoint that merges results from both methods. - Verified with unit tests covering both univariate and multivariate scenarios. ### File List - /backend/app/core/engine/clean.py - /backend/app/api/v1/analysis.py - /backend/tests/test_analysis.py - /backend/pyproject.toml