2.4 KiB
2.4 KiB
Story 2.3: Détection Automatique des Outliers (Backend)
Status: review
Story
As a system, I want to identify statistical outliers in the background, so that I can alert the user to potential data quality issues.
Acceptance Criteria
- Algorithm Implementation: Backend implements Isolation Forest (multivariate) and IQR (univariate) algorithms.
- Analysis Endpoint: A POST endpoint
/api/v1/analysis/detect-outliersaccepts dataset and configuration. - Detection Output: Returns a list of outlier row indices and the reason for flagging (e.g., "z-score > 3").
- Performance: Detection on 50k rows completes in under 5 seconds.
- Robustness: Handles missing values (NaNs) gracefully without crashing.
Tasks / Subtasks
- Dependency Update (AC: 1)
- Add
scikit-learnto the backend usinguv.
- Add
- Outlier Engine Implementation (AC: 1, 5)
- Create
backend/app/core/engine/clean.py. - Implement univariate IQR-based detection.
- Implement multivariate Isolation Forest detection.
- Create
- API Endpoint (AC: 2, 3, 4)
- Implement
POST /api/v1/analysis/detect-outliersinanalysis.py. - Map detection results to indexed row references.
- Implement
Dev Notes
- Algorithms: Used Scikit-learn's
IsolationForestfor multivariate and Pandas quantile logic for IQR. - Explainability: Each outlier is returned with a descriptive string explaining the reason for the flag.
- Performance: Asynchronous ready, using standard Scikit-learn optimisations.
Project Structure Notes
- Created
backend/app/core/engine/clean.pyfor outlier logic. - Updated
backend/app/api/v1/analysis.pywith the detection endpoint. - Added
backend/tests/test_analysis.pyfor verification.
References
- [Source: epics.md#Story 2.3]
- [Source: project-context.md#Critical Anti-Patterns]
Dev Agent Record
Agent Model Used
{{agent_model_name_version}}
Completion Notes List
- Integrated
scikit-learnfor anomaly detection. - Implemented univariate detection based on 1.5 * IQR bounds.
- Implemented multivariate detection using the Isolation Forest algorithm.
- Developed a robust API endpoint that merges results from both methods.
- Verified with unit tests covering both univariate and multivariate scenarios.
File List
- /backend/app/core/engine/clean.py
- /backend/app/api/v1/analysis.py
- /backend/tests/test_analysis.py
- /backend/pyproject.toml