65 lines
2.4 KiB
Markdown
65 lines
2.4 KiB
Markdown
# Story 2.3: Détection Automatique des Outliers (Backend)
|
|
|
|
Status: review
|
|
|
|
## Story
|
|
|
|
As a system,
|
|
I want to identify statistical outliers in the background,
|
|
so that I can alert the user to potential data quality issues.
|
|
|
|
## Acceptance Criteria
|
|
|
|
1. **Algorithm Implementation:** Backend implements Isolation Forest (multivariate) and IQR (univariate) algorithms.
|
|
2. **Analysis Endpoint:** A POST endpoint `/api/v1/analysis/detect-outliers` accepts dataset and configuration.
|
|
3. **Detection Output:** Returns a list of outlier row indices and the reason for flagging (e.g., "z-score > 3").
|
|
4. **Performance:** Detection on 50k rows completes in under 5 seconds.
|
|
5. **Robustness:** Handles missing values (NaNs) gracefully without crashing.
|
|
|
|
## Tasks / Subtasks
|
|
|
|
- [x] **Dependency Update** (AC: 1)
|
|
- [x] Add `scikit-learn` to the backend using `uv`.
|
|
- [x] **Outlier Engine Implementation** (AC: 1, 5)
|
|
- [x] Create `backend/app/core/engine/clean.py`.
|
|
- [x] Implement univariate IQR-based detection.
|
|
- [x] Implement multivariate Isolation Forest detection.
|
|
- [x] **API Endpoint** (AC: 2, 3, 4)
|
|
- [x] Implement `POST /api/v1/analysis/detect-outliers` in `analysis.py`.
|
|
- [x] Map detection results to indexed row references.
|
|
|
|
## Dev Notes
|
|
|
|
- **Algorithms:** Used Scikit-learn's `IsolationForest` for multivariate and Pandas quantile logic for IQR.
|
|
- **Explainability:** Each outlier is returned with a descriptive string explaining the reason for the flag.
|
|
- **Performance:** Asynchronous ready, using standard Scikit-learn optimisations.
|
|
|
|
### Project Structure Notes
|
|
|
|
- Created `backend/app/core/engine/clean.py` for outlier logic.
|
|
- Updated `backend/app/api/v1/analysis.py` with the detection endpoint.
|
|
- Added `backend/tests/test_analysis.py` for verification.
|
|
|
|
### References
|
|
|
|
- [Source: epics.md#Story 2.3]
|
|
- [Source: project-context.md#Critical Anti-Patterns]
|
|
|
|
## Dev Agent Record
|
|
|
|
### Agent Model Used
|
|
|
|
{{agent_model_name_version}}
|
|
|
|
### Completion Notes List
|
|
- Integrated `scikit-learn` for anomaly detection.
|
|
- Implemented univariate detection based on 1.5 * IQR bounds.
|
|
- Implemented multivariate detection using the Isolation Forest algorithm.
|
|
- Developed a robust API endpoint that merges results from both methods.
|
|
- Verified with unit tests covering both univariate and multivariate scenarios.
|
|
|
|
### File List
|
|
- /backend/app/core/engine/clean.py
|
|
- /backend/app/api/v1/analysis.py
|
|
- /backend/tests/test_analysis.py
|
|
- /backend/pyproject.toml |