Analysis/_bmad-output/implementation-artifacts/2-3-detection-automatique-des-outliers-backend.md
2026-01-11 22:56:02 +01:00

2.4 KiB

Story 2.3: Détection Automatique des Outliers (Backend)

Status: review

Story

As a system, I want to identify statistical outliers in the background, so that I can alert the user to potential data quality issues.

Acceptance Criteria

  1. Algorithm Implementation: Backend implements Isolation Forest (multivariate) and IQR (univariate) algorithms.
  2. Analysis Endpoint: A POST endpoint /api/v1/analysis/detect-outliers accepts dataset and configuration.
  3. Detection Output: Returns a list of outlier row indices and the reason for flagging (e.g., "z-score > 3").
  4. Performance: Detection on 50k rows completes in under 5 seconds.
  5. Robustness: Handles missing values (NaNs) gracefully without crashing.

Tasks / Subtasks

  • Dependency Update (AC: 1)
    • Add scikit-learn to the backend using uv.
  • Outlier Engine Implementation (AC: 1, 5)
    • Create backend/app/core/engine/clean.py.
    • Implement univariate IQR-based detection.
    • Implement multivariate Isolation Forest detection.
  • API Endpoint (AC: 2, 3, 4)
    • Implement POST /api/v1/analysis/detect-outliers in analysis.py.
    • Map detection results to indexed row references.

Dev Notes

  • Algorithms: Used Scikit-learn's IsolationForest for multivariate and Pandas quantile logic for IQR.
  • Explainability: Each outlier is returned with a descriptive string explaining the reason for the flag.
  • Performance: Asynchronous ready, using standard Scikit-learn optimisations.

Project Structure Notes

  • Created backend/app/core/engine/clean.py for outlier logic.
  • Updated backend/app/api/v1/analysis.py with the detection endpoint.
  • Added backend/tests/test_analysis.py for verification.

References

  • [Source: epics.md#Story 2.3]
  • [Source: project-context.md#Critical Anti-Patterns]

Dev Agent Record

Agent Model Used

{{agent_model_name_version}}

Completion Notes List

  • Integrated scikit-learn for anomaly detection.
  • Implemented univariate detection based on 1.5 * IQR bounds.
  • Implemented multivariate detection using the Isolation Forest algorithm.
  • Developed a robust API endpoint that merges results from both methods.
  • Verified with unit tests covering both univariate and multivariate scenarios.

File List

  • /backend/app/core/engine/clean.py
  • /backend/app/api/v1/analysis.py
  • /backend/tests/test_analysis.py
  • /backend/pyproject.toml