2026-01-11 22:56:02 +01:00

14 KiB

stepsCompleted inputDocuments
1
2
3
_bmad-output/planning-artifacts/prd.md
_bmad-output/planning-artifacts/architecture.md
_bmad-output/planning-artifacts/ux-design-specification.md

Data_analysis - Epic Breakdown

Overview

This document provides the complete epic and story breakdown for Data_analysis, decomposing the requirements from the PRD, UX Design if it exists, and Architecture requirements into implementable stories.

Requirements Inventory

Functional Requirements

  • FR1: Users can upload datasets in .xlsx, .xls, and .csv formats via drag-and-drop or file selection.
  • FR2: System automatically detects column data types (numeric, categorical, datetime) upon ingest.
  • FR3: Users can manually override detected data types if the inference is incorrect.
  • FR4: Users can rename columns directly in the interface to sanitize inputs.
  • FR5: Users can view loaded data in a paginated, virtualized grid capable of displaying 50,000+ rows.
  • FR6: Users can edit cell values directly (double-click to edit) with inputs validated against the column type.
  • FR7: Users can sort columns (asc/desc) and filter rows based on values/conditions (e.g., "> 100").
  • FR8: Users can perform Undo/Redo operations (Ctrl+Z/Ctrl+Y) on data edits within the current session.
  • FR9: Users can exclude specific rows from analysis without deleting them (soft delete/toggle).
  • FR10: System automatically identifies univariate outliers using IQR/Z-score and visualizes them in the grid/plots.
  • FR11: System automatically identifies multivariate outliers using Isolation Forest upon user request.
  • FR12: Users can accept or reject outlier exclusion proposals individually or in bulk.
  • FR13: Users can select a Target Variable (Y) to trigger an automated Feature Importance analysis.
  • FR14: System recommends the Top-N predictive features based on RFE (Recursive Feature Elimination) or Random Forest importance.
  • FR15: Users can configure a Linear Regression (Simple/Multiple) by selecting Dependent (Y) and Independent (X) variables.
  • FR16: Users can configure a Binary Logistic Regression for categorical target variables.
  • FR17: System generates a "Model Summary" including R-squared, Adjusted R-squared, F-statistic, and P-values for coefficients.
  • FR18: System generates standard diagnostic plots: Residuals vs Fitted, Q-Q Plot, and Scale-Location.
  • FR19: Users can view a Correlation Matrix (Heatmap) for selected numeric variables.
  • FR20: Users can view an interactive "Analysis Report" dashboard summarizing data health, methodology, and model results.
  • FR21: Users can export the full report as a branded PDF document.
  • FR22: System appends an "Audit Trail" to the report listing library versions, random seeds, and data exclusion steps for reproducibility.

NonFunctional Requirements

  • Performance: Grid latency < 200ms for 50k rows. Analysis throughput < 15s. Upload speed < 3s for 5MB.
  • Security: Data ephemerality (purge after 1h). TLS 1.3 encryption. Input sanitization for files.
  • Reliability: Graceful degradation for bad data. Support 50 concurrent requests via async task queue.
  • Accessibility: Keyboard navigation for "Smart Grid". Screen reader support. WCAG 2.1 Level AA compliance.

Additional Requirements

Architecture:

  • Starter Template: Custom FastAPI-Next.js-Docker Boilerplate.
  • Data Serialization: Apache Arrow (IPC Stream) required for grid data.
  • State Management: Hybrid approach (TanStack Query for Server State + Zustand for Grid UI State).
  • Deployment: "Two-Service" model on Homelab via Docker Compose.
  • Naming Conventions: snake_case for Python/API, PascalCase for React components.
  • Testing: Pytest (Backend) and Vitest (Frontend).

UX Design:

  • Visual Style: "Lab & Tech" (Slate/Indigo/Mono) with Shadcn UI.
  • Responsive: Desktop Only (1366px+).
  • Core Interaction: "Guided Data Hygiene Loop" (Insight Panel).
  • Design System: TanStack Table for virtualization + Recharts for visualization.
  • Mode: Native Dark Mode support.

FR Coverage Map

  • FR1: Epic 1 - Data Ingestion
  • FR2: Epic 1 - Type Auto-detection
  • FR3: Epic 1 - Manual Type Override
  • FR4: Epic 1 - Column Renaming
  • FR5: Epic 1 - High-Performance Grid View
  • FR6: Epic 2 - Grid Cell Editing
  • FR7: Epic 1 - Grid Sort/Filter
  • FR8: Epic 2 - Edit Undo/Redo
  • FR9: Epic 2 - Row Exclusion Logic
  • FR10: Epic 2 - Univariate Outlier Detection
  • FR11: Epic 2 - Multivariate Outlier Detection
  • FR12: Epic 2 - Outlier Review UI (Insight Panel)
  • FR13: Epic 3 - Feature Importance Engine
  • FR14: Epic 3 - Smart Feature Recommendation
  • FR15: Epic 4 - Linear Regression Configuration
  • FR16: Epic 4 - Logistic Regression Configuration
  • FR17: Epic 4 - Model Summary & Metrics
  • FR18: Epic 4 - Diagnostic Plots
  • FR19: Epic 3 - Correlation Matrix Visualization
  • FR20: Epic 4 - Interactive Analysis Dashboard
  • FR21: Epic 4 - PDF Export
  • FR22: Epic 4 - Reproducibility Audit Trail

Epic List

Epic 1: Fondation & Ingestion de Données

"Je peux uploader mon fichier Excel et voir mes données brutes dans une grille fluide." FRs covered: FR1, FR2, FR3, FR4, FR5, FR7.

Epic 2: Nettoyage Interactif (Hygiene Loop)

"Je peux nettoyer mes données en supprimant les lignes erronées ou les outliers." FRs covered: FR6, FR8, FR9, FR10, FR11, FR12.

Epic 3: Intelligence & Sélection (Smart Prep)

"Le système me dit quelles variables sont importantes pour ma cible." FRs covered: FR13, FR14, FR19.

Epic 4: Modélisation & Reporting

"Je génère mon modèle de régression et j'exporte le rapport PDF." FRs covered: FR15, FR16, FR17, FR18, FR20, FR21, FR22.


Epic 1: Fondation & Ingestion de Données

"Je peux uploader mon fichier Excel et voir mes données brutes dans une grille fluide."

Story 1.1: Initialisation du Monorepo & Docker

As a developer, I want to initialize the project structure (Next.js + FastAPI + Docker), So that I have a functional and consistent development environment.

Acceptance Criteria: Given A fresh project directory. When I run docker-compose up. Then Both the Next.js frontend and FastAPI backend are reachable on their respective ports. And The shared Docker network allows communication between services.

Story 1.2: Ingestion de Fichiers Excel/CSV (Backend)

As a Julien (Analyst), I want to upload an Excel or CSV file, So that the system can read my production data.

Acceptance Criteria: Given A valid .xlsx file with multiple columns and 5,000 rows. When I POST the file to the /upload endpoint. Then The backend returns a 200 OK with column metadata (names, detected types). And The data is prepared as an Apache Arrow stream for high-performance delivery.

Story 1.3: Visualisation dans la Smart Grid (Frontend)

As a Julien (Analyst), I want to see my uploaded data in an interactive high-speed grid, So that I can explore the raw data effortlessly.

Acceptance Criteria: Given A dataset successfully loaded in the backend. When I view the workspace page. Then The TanStack Table renders the data using virtualization. And Scrolling through 50,000 rows remains fluid (< 200ms latency).

Story 1.4: Gestion des Types & Renommage (Data Hygiene)

As a Julien (Analyst), I want to rename columns and correct data types, So that the data matches my business context before analysis.

Acceptance Criteria: Given A column "Press_01" detected as 'text'. When I click the column header to rename it to "Pressure" and change type to 'numeric'. Then The grid updates the visual formatting immediately. And The backend validates that all values in the column can be cast to numeric.

Story 1.5: Tri & Filtrage de Base

As a Julien (Analyst), I want to sort and filter my data in the grid, So that I can identify extreme values or specific subsets.

Acceptance Criteria: Given A column "Temperature". When I click 'Sort Descending'. Then The highest temperature values appear at the top of the grid instantly.


Epic 2: Nettoyage Interactif (Hygiene Loop)

"Je peux nettoyer mes données en supprimant les lignes erronées ou les outliers."

Story 2.1: Édition de Cellule & Validation

As a Julien (Analyst), I want to edit cell values directly in the grid, So that I can manually correct obvious data entry errors.

Acceptance Criteria: Given A data cell in the grid. When I double-click the cell and enter a new value. Then The value is updated in the local UI state (Zustand). And The system validates the input against the column's data type (e.g., no text in numeric columns).

Story 2.2: Undo/Redo des Modifications

As a Julien (Analyst), I want to undo my last data edits, So that I can explore changes without fear of losing the original data.

Acceptance Criteria: Given A cell value was modified. When I press Ctrl+Z (or click Undo). Then The cell reverts to its previous value. And Ctrl+Y (Redo) restores the edit.

Story 2.3: Détection Automatique des Outliers (Backend)

As a system, I want to identify statistical outliers in the background, So that I can alert the user to potential data quality issues.

Acceptance Criteria: Given A dataset is loaded. When The analysis engine runs. Then It uses Isolation Forest (multivariate) and IQR (univariate) to tag suspicious rows. And Outlier coordinates are returned to the frontend.

Story 2.4: Panel d'Insights & Revue des Outliers (Frontend)

As a Julien (Analyst), I want to review detected outliers in a side panel, So that I can understand why they are flagged before excluding them.

Acceptance Criteria: Given Flagged outliers exist. When I click the warning icon in a column header. Then The InsightPanel opens with a boxplot visualization and a "Why?" explanation. And A button "Exclude all 34 outliers" is prominently displayed.

Story 2.5: Exclusion Non-Destructive de Données

As a Julien (Analyst), I want to toggle the inclusion of specific rows in the analysis, So that I can test different scenarios without deleting data.

Acceptance Criteria: Given A flagged outlier row. When I click "Exclude". Then The row appears with 40% opacity in the grid. And The row is ignored by all subsequent statistical calculations (R², Regression).


Epic 3: Intelligence & Sélection (Smart Prep)

"Le système me dit quelles variables sont importantes pour ma cible."

Story 3.1: Matrice de Corrélation Interactive

As a Julien (Analyst), I want to see a visual correlation map of my numeric variables, So that I can quickly identify which factors are related.

Acceptance Criteria: Given A dataset with multiple numeric columns. When I navigate to the "Correlations" tab. Then A heatmap is displayed using Pearson correlation coefficients. And Hovering over a cell shows the precise correlation value.

Story 3.2: Calcul de l'Importance des Features (Backend)

As a system, I want to compute the predictive power of features against a target variable, So that I can provide scientific recommendations to the user.

Acceptance Criteria: Given A dataset and a selected Target Variable (Y). When The RFE (Recursive Feature Elimination) algorithm runs. Then The backend returns an ordered list of features with their importance scores.

Story 3.3: Recommandation Intelligente de Variables (Frontend)

As a Julien (Analyst), I want the system to suggest which variables to include in my model, So that I don't pollute my analysis with irrelevant data ("noise").

Acceptance Criteria: Given Feature importance scores are calculated. When I open the Model Configuration panel. Then The top 5 predictive variables are pre-selected by default. And An explanation "Why?" is available for each recommendation.


Epic 4: Modélisation & Reporting

"Je génère mon modèle de régression et j'exporte le rapport PDF."

Story 4.1: Configuration de la Régression

As a Julien (Analyst), I want to configure the parameters of my regression model, So that I can tailor the analysis to my specific hypothesis.

Acceptance Criteria: Given A cleaned dataset. When I select "Linear Regression" and confirm X/Y variables. Then The system validates that the target variable (Y) is suitable for the chosen model type.

Story 4.2: Exécution du Modèle (Backend)

As a system, I want to execute the statistical model computation, So that I can provide accurate regression results.

Acceptance Criteria: Given Model parameters (X, Y, Algorithm). When The "Run" action is triggered. Then The backend computes R², Adjusted R², P-values, and coefficients using statsmodels. And All results are returned as a JSON summary.

Story 4.3: Dashboard de Résultats Interactif

As a Julien (Analyst), I want to see the model results through interactive charts, So that I can easily diagnose the performance of my regression.

Acceptance Criteria: Given Computed model results. When I view the "Results" page. Then I see a "Real vs Predicted" scatter plot and a "Residuals" plot. And Key metrics (R², P-value) are displayed with colored status indicators (Success/Warning).

Story 4.4: Génération du Rapport PDF (Audit Trail)

As a Julien (Analyst), I want to export my findings as a professional PDF report, So that I can share and archive my validated analysis.

Acceptance Criteria: Given A completed analysis session. When I click "Export PDF". Then A PDF is generated containing all charts, metrics, and a reproducibility section (lib versions, seeds). And The report lists all rows that were excluded during the session.