Analysis/_bmad-output/implementation-artifacts/2-3-detection-automatique-des-outliers-backend.md
2026-01-11 22:56:02 +01:00

65 lines
2.4 KiB
Markdown

# Story 2.3: Détection Automatique des Outliers (Backend)
Status: review
## Story
As a system,
I want to identify statistical outliers in the background,
so that I can alert the user to potential data quality issues.
## Acceptance Criteria
1. **Algorithm Implementation:** Backend implements Isolation Forest (multivariate) and IQR (univariate) algorithms.
2. **Analysis Endpoint:** A POST endpoint `/api/v1/analysis/detect-outliers` accepts dataset and configuration.
3. **Detection Output:** Returns a list of outlier row indices and the reason for flagging (e.g., "z-score > 3").
4. **Performance:** Detection on 50k rows completes in under 5 seconds.
5. **Robustness:** Handles missing values (NaNs) gracefully without crashing.
## Tasks / Subtasks
- [x] **Dependency Update** (AC: 1)
- [x] Add `scikit-learn` to the backend using `uv`.
- [x] **Outlier Engine Implementation** (AC: 1, 5)
- [x] Create `backend/app/core/engine/clean.py`.
- [x] Implement univariate IQR-based detection.
- [x] Implement multivariate Isolation Forest detection.
- [x] **API Endpoint** (AC: 2, 3, 4)
- [x] Implement `POST /api/v1/analysis/detect-outliers` in `analysis.py`.
- [x] Map detection results to indexed row references.
## Dev Notes
- **Algorithms:** Used Scikit-learn's `IsolationForest` for multivariate and Pandas quantile logic for IQR.
- **Explainability:** Each outlier is returned with a descriptive string explaining the reason for the flag.
- **Performance:** Asynchronous ready, using standard Scikit-learn optimisations.
### Project Structure Notes
- Created `backend/app/core/engine/clean.py` for outlier logic.
- Updated `backend/app/api/v1/analysis.py` with the detection endpoint.
- Added `backend/tests/test_analysis.py` for verification.
### References
- [Source: epics.md#Story 2.3]
- [Source: project-context.md#Critical Anti-Patterns]
## Dev Agent Record
### Agent Model Used
{{agent_model_name_version}}
### Completion Notes List
- Integrated `scikit-learn` for anomaly detection.
- Implemented univariate detection based on 1.5 * IQR bounds.
- Implemented multivariate detection using the Isolation Forest algorithm.
- Developed a robust API endpoint that merges results from both methods.
- Verified with unit tests covering both univariate and multivariate scenarios.
### File List
- /backend/app/core/engine/clean.py
- /backend/app/api/v1/analysis.py
- /backend/tests/test_analysis.py
- /backend/pyproject.toml