312 lines
14 KiB
Markdown
312 lines
14 KiB
Markdown
---
|
|
stepsCompleted: [1, 2, 3]
|
|
inputDocuments: ['_bmad-output/planning-artifacts/prd.md', '_bmad-output/planning-artifacts/architecture.md', '_bmad-output/planning-artifacts/ux-design-specification.md']
|
|
---
|
|
|
|
# Data_analysis - Epic Breakdown
|
|
|
|
## Overview
|
|
|
|
This document provides the complete epic and story breakdown for Data_analysis, decomposing the requirements from the PRD, UX Design if it exists, and Architecture requirements into implementable stories.
|
|
|
|
## Requirements Inventory
|
|
|
|
### Functional Requirements
|
|
|
|
- **FR1:** Users can upload datasets in .xlsx, .xls, and .csv formats via drag-and-drop or file selection.
|
|
- **FR2:** System automatically detects column data types (numeric, categorical, datetime) upon ingest.
|
|
- **FR3:** Users can manually override detected data types if the inference is incorrect.
|
|
- **FR4:** Users can rename columns directly in the interface to sanitize inputs.
|
|
- **FR5:** Users can view loaded data in a paginated, virtualized grid capable of displaying 50,000+ rows.
|
|
- **FR6:** Users can edit cell values directly (double-click to edit) with inputs validated against the column type.
|
|
- **FR7:** Users can sort columns (asc/desc) and filter rows based on values/conditions (e.g., "> 100").
|
|
- **FR8:** Users can perform Undo/Redo operations (Ctrl+Z/Ctrl+Y) on data edits within the current session.
|
|
- **FR9:** Users can exclude specific rows from analysis without deleting them (soft delete/toggle).
|
|
- **FR10:** System automatically identifies univariate outliers using IQR/Z-score and visualizes them in the grid/plots.
|
|
- **FR11:** System automatically identifies multivariate outliers using Isolation Forest upon user request.
|
|
- **FR12:** Users can accept or reject outlier exclusion proposals individually or in bulk.
|
|
- **FR13:** Users can select a Target Variable (Y) to trigger an automated Feature Importance analysis.
|
|
- **FR14:** System recommends the Top-N predictive features based on RFE (Recursive Feature Elimination) or Random Forest importance.
|
|
- **FR15:** Users can configure a Linear Regression (Simple/Multiple) by selecting Dependent (Y) and Independent (X) variables.
|
|
- **FR16:** Users can configure a Binary Logistic Regression for categorical target variables.
|
|
- **FR17:** System generates a "Model Summary" including R-squared, Adjusted R-squared, F-statistic, and P-values for coefficients.
|
|
- **FR18:** System generates standard diagnostic plots: Residuals vs Fitted, Q-Q Plot, and Scale-Location.
|
|
- **FR19:** Users can view a Correlation Matrix (Heatmap) for selected numeric variables.
|
|
- **FR20:** Users can view an interactive "Analysis Report" dashboard summarizing data health, methodology, and model results.
|
|
- **FR21:** Users can export the full report as a branded PDF document.
|
|
- **FR22:** System appends an "Audit Trail" to the report listing library versions, random seeds, and data exclusion steps for reproducibility.
|
|
|
|
### NonFunctional Requirements
|
|
|
|
- **Performance:** Grid latency < 200ms for 50k rows. Analysis throughput < 15s. Upload speed < 3s for 5MB.
|
|
- **Security:** Data ephemerality (purge after 1h). TLS 1.3 encryption. Input sanitization for files.
|
|
- **Reliability:** Graceful degradation for bad data. Support 50 concurrent requests via async task queue.
|
|
- **Accessibility:** Keyboard navigation for "Smart Grid". Screen reader support. WCAG 2.1 Level AA compliance.
|
|
|
|
### Additional Requirements
|
|
|
|
**Architecture:**
|
|
- **Starter Template:** Custom FastAPI-Next.js-Docker Boilerplate.
|
|
- **Data Serialization:** Apache Arrow (IPC Stream) required for grid data.
|
|
- **State Management:** Hybrid approach (TanStack Query for Server State + Zustand for Grid UI State).
|
|
- **Deployment:** "Two-Service" model on Homelab via Docker Compose.
|
|
- **Naming Conventions:** `snake_case` for Python/API, `PascalCase` for React components.
|
|
- **Testing:** Pytest (Backend) and Vitest (Frontend).
|
|
|
|
**UX Design:**
|
|
- **Visual Style:** "Lab & Tech" (Slate/Indigo/Mono) with Shadcn UI.
|
|
- **Responsive:** Desktop Only (1366px+).
|
|
- **Core Interaction:** "Guided Data Hygiene Loop" (Insight Panel).
|
|
- **Design System:** TanStack Table for virtualization + Recharts for visualization.
|
|
- **Mode:** Native Dark Mode support.
|
|
|
|
### FR Coverage Map
|
|
|
|
- **FR1:** Epic 1 - Data Ingestion
|
|
- **FR2:** Epic 1 - Type Auto-detection
|
|
- **FR3:** Epic 1 - Manual Type Override
|
|
- **FR4:** Epic 1 - Column Renaming
|
|
- **FR5:** Epic 1 - High-Performance Grid View
|
|
- **FR6:** Epic 2 - Grid Cell Editing
|
|
- **FR7:** Epic 1 - Grid Sort/Filter
|
|
- **FR8:** Epic 2 - Edit Undo/Redo
|
|
- **FR9:** Epic 2 - Row Exclusion Logic
|
|
- **FR10:** Epic 2 - Univariate Outlier Detection
|
|
- **FR11:** Epic 2 - Multivariate Outlier Detection
|
|
- **FR12:** Epic 2 - Outlier Review UI (Insight Panel)
|
|
- **FR13:** Epic 3 - Feature Importance Engine
|
|
- **FR14:** Epic 3 - Smart Feature Recommendation
|
|
- **FR15:** Epic 4 - Linear Regression Configuration
|
|
- **FR16:** Epic 4 - Logistic Regression Configuration
|
|
- **FR17:** Epic 4 - Model Summary & Metrics
|
|
- **FR18:** Epic 4 - Diagnostic Plots
|
|
- **FR19:** Epic 3 - Correlation Matrix Visualization
|
|
- **FR20:** Epic 4 - Interactive Analysis Dashboard
|
|
- **FR21:** Epic 4 - PDF Export
|
|
- **FR22:** Epic 4 - Reproducibility Audit Trail
|
|
|
|
## Epic List
|
|
|
|
### Epic 1: Fondation & Ingestion de Données
|
|
"Je peux uploader mon fichier Excel et voir mes données brutes dans une grille fluide."
|
|
**FRs covered:** FR1, FR2, FR3, FR4, FR5, FR7.
|
|
|
|
### Epic 2: Nettoyage Interactif (Hygiene Loop)
|
|
"Je peux nettoyer mes données en supprimant les lignes erronées ou les outliers."
|
|
**FRs covered:** FR6, FR8, FR9, FR10, FR11, FR12.
|
|
|
|
### Epic 3: Intelligence & Sélection (Smart Prep)
|
|
"Le système me dit quelles variables sont importantes pour ma cible."
|
|
**FRs covered:** FR13, FR14, FR19.
|
|
|
|
### Epic 4: Modélisation & Reporting
|
|
"Je génère mon modèle de régression et j'exporte le rapport PDF."
|
|
**FRs covered:** FR15, FR16, FR17, FR18, FR20, FR21, FR22.
|
|
|
|
---
|
|
|
|
## Epic 1: Fondation & Ingestion de Données
|
|
|
|
"Je peux uploader mon fichier Excel et voir mes données brutes dans une grille fluide."
|
|
|
|
### Story 1.1: Initialisation du Monorepo & Docker
|
|
As a developer,
|
|
I want to initialize the project structure (Next.js + FastAPI + Docker),
|
|
So that I have a functional and consistent development environment.
|
|
|
|
**Acceptance Criteria:**
|
|
**Given** A fresh project directory.
|
|
**When** I run `docker-compose up`.
|
|
**Then** Both the Next.js frontend and FastAPI backend are reachable on their respective ports.
|
|
**And** The shared Docker network allows communication between services.
|
|
|
|
### Story 1.2: Ingestion de Fichiers Excel/CSV (Backend)
|
|
As a Julien (Analyst),
|
|
I want to upload an Excel or CSV file,
|
|
So that the system can read my production data.
|
|
|
|
**Acceptance Criteria:**
|
|
**Given** A valid `.xlsx` file with multiple columns and 5,000 rows.
|
|
**When** I POST the file to the `/upload` endpoint.
|
|
**Then** The backend returns a 200 OK with column metadata (names, detected types).
|
|
**And** The data is prepared as an Apache Arrow stream for high-performance delivery.
|
|
|
|
### Story 1.3: Visualisation dans la Smart Grid (Frontend)
|
|
As a Julien (Analyst),
|
|
I want to see my uploaded data in an interactive high-speed grid,
|
|
So that I can explore the raw data effortlessly.
|
|
|
|
**Acceptance Criteria:**
|
|
**Given** A dataset successfully loaded in the backend.
|
|
**When** I view the workspace page.
|
|
**Then** The TanStack Table renders the data using virtualization.
|
|
**And** Scrolling through 50,000 rows remains fluid (< 200ms latency).
|
|
|
|
### Story 1.4: Gestion des Types & Renommage (Data Hygiene)
|
|
As a Julien (Analyst),
|
|
I want to rename columns and correct data types,
|
|
So that the data matches my business context before analysis.
|
|
|
|
**Acceptance Criteria:**
|
|
**Given** A column "Press_01" detected as 'text'.
|
|
**When** I click the column header to rename it to "Pressure" and change type to 'numeric'.
|
|
**Then** The grid updates the visual formatting immediately.
|
|
**And** The backend validates that all values in the column can be cast to numeric.
|
|
|
|
### Story 1.5: Tri & Filtrage de Base
|
|
As a Julien (Analyst),
|
|
I want to sort and filter my data in the grid,
|
|
So that I can identify extreme values or specific subsets.
|
|
|
|
**Acceptance Criteria:**
|
|
**Given** A column "Temperature".
|
|
**When** I click 'Sort Descending'.
|
|
**Then** The highest temperature values appear at the top of the grid instantly.
|
|
|
|
---
|
|
|
|
## Epic 2: Nettoyage Interactif (Hygiene Loop)
|
|
|
|
"Je peux nettoyer mes données en supprimant les lignes erronées ou les outliers."
|
|
|
|
### Story 2.1: Édition de Cellule & Validation
|
|
As a Julien (Analyst),
|
|
I want to edit cell values directly in the grid,
|
|
So that I can manually correct obvious data entry errors.
|
|
|
|
**Acceptance Criteria:**
|
|
**Given** A data cell in the grid.
|
|
**When** I double-click the cell and enter a new value.
|
|
**Then** The value is updated in the local UI state (Zustand).
|
|
**And** The system validates the input against the column's data type (e.g., no text in numeric columns).
|
|
|
|
### Story 2.2: Undo/Redo des Modifications
|
|
As a Julien (Analyst),
|
|
I want to undo my last data edits,
|
|
So that I can explore changes without fear of losing the original data.
|
|
|
|
**Acceptance Criteria:**
|
|
**Given** A cell value was modified.
|
|
**When** I press `Ctrl+Z` (or click Undo).
|
|
**Then** The cell reverts to its previous value.
|
|
**And** `Ctrl+Y` (Redo) restores the edit.
|
|
|
|
### Story 2.3: Détection Automatique des Outliers (Backend)
|
|
As a system,
|
|
I want to identify statistical outliers in the background,
|
|
So that I can alert the user to potential data quality issues.
|
|
|
|
**Acceptance Criteria:**
|
|
**Given** A dataset is loaded.
|
|
**When** The analysis engine runs.
|
|
**Then** It uses Isolation Forest (multivariate) and IQR (univariate) to tag suspicious rows.
|
|
**And** Outlier coordinates are returned to the frontend.
|
|
|
|
### Story 2.4: Panel d'Insights & Revue des Outliers (Frontend)
|
|
As a Julien (Analyst),
|
|
I want to review detected outliers in a side panel,
|
|
So that I can understand why they are flagged before excluding them.
|
|
|
|
**Acceptance Criteria:**
|
|
**Given** Flagged outliers exist.
|
|
**When** I click the warning icon in a column header.
|
|
**Then** The `InsightPanel` opens with a boxplot visualization and a "Why?" explanation.
|
|
**And** A button "Exclude all 34 outliers" is prominently displayed.
|
|
|
|
### Story 2.5: Exclusion Non-Destructive de Données
|
|
As a Julien (Analyst),
|
|
I want to toggle the inclusion of specific rows in the analysis,
|
|
So that I can test different scenarios without deleting data.
|
|
|
|
**Acceptance Criteria:**
|
|
**Given** A flagged outlier row.
|
|
**When** I click "Exclude".
|
|
**Then** The row appears with 40% opacity in the grid.
|
|
**And** The row is ignored by all subsequent statistical calculations (R², Regression).
|
|
|
|
---
|
|
|
|
## Epic 3: Intelligence & Sélection (Smart Prep)
|
|
|
|
"Le système me dit quelles variables sont importantes pour ma cible."
|
|
|
|
### Story 3.1: Matrice de Corrélation Interactive
|
|
As a Julien (Analyst),
|
|
I want to see a visual correlation map of my numeric variables,
|
|
So that I can quickly identify which factors are related.
|
|
|
|
**Acceptance Criteria:**
|
|
**Given** A dataset with multiple numeric columns.
|
|
**When** I navigate to the "Correlations" tab.
|
|
**Then** A heatmap is displayed using Pearson correlation coefficients.
|
|
**And** Hovering over a cell shows the precise correlation value.
|
|
|
|
### Story 3.2: Calcul de l'Importance des Features (Backend)
|
|
As a system,
|
|
I want to compute the predictive power of features against a target variable,
|
|
So that I can provide scientific recommendations to the user.
|
|
|
|
**Acceptance Criteria:**
|
|
**Given** A dataset and a selected Target Variable (Y).
|
|
**When** The RFE (Recursive Feature Elimination) algorithm runs.
|
|
**Then** The backend returns an ordered list of features with their importance scores.
|
|
|
|
### Story 3.3: Recommandation Intelligente de Variables (Frontend)
|
|
As a Julien (Analyst),
|
|
I want the system to suggest which variables to include in my model,
|
|
So that I don't pollute my analysis with irrelevant data ("noise").
|
|
|
|
**Acceptance Criteria:**
|
|
**Given** Feature importance scores are calculated.
|
|
**When** I open the Model Configuration panel.
|
|
**Then** The top 5 predictive variables are pre-selected by default.
|
|
**And** An explanation "Why?" is available for each recommendation.
|
|
|
|
---
|
|
|
|
## Epic 4: Modélisation & Reporting
|
|
|
|
"Je génère mon modèle de régression et j'exporte le rapport PDF."
|
|
|
|
### Story 4.1: Configuration de la Régression
|
|
As a Julien (Analyst),
|
|
I want to configure the parameters of my regression model,
|
|
So that I can tailor the analysis to my specific hypothesis.
|
|
|
|
**Acceptance Criteria:**
|
|
**Given** A cleaned dataset.
|
|
**When** I select "Linear Regression" and confirm X/Y variables.
|
|
**Then** The system validates that the target variable (Y) is suitable for the chosen model type.
|
|
|
|
### Story 4.2: Exécution du Modèle (Backend)
|
|
As a system,
|
|
I want to execute the statistical model computation,
|
|
So that I can provide accurate regression results.
|
|
|
|
**Acceptance Criteria:**
|
|
**Given** Model parameters (X, Y, Algorithm).
|
|
**When** The "Run" action is triggered.
|
|
**Then** The backend computes R², Adjusted R², P-values, and coefficients using `statsmodels`.
|
|
**And** All results are returned as a JSON summary.
|
|
|
|
### Story 4.3: Dashboard de Résultats Interactif
|
|
As a Julien (Analyst),
|
|
I want to see the model results through interactive charts,
|
|
So that I can easily diagnose the performance of my regression.
|
|
|
|
**Acceptance Criteria:**
|
|
**Given** Computed model results.
|
|
**When** I view the "Results" page.
|
|
**Then** I see a "Real vs Predicted" scatter plot and a "Residuals" plot.
|
|
**And** Key metrics (R², P-value) are displayed with colored status indicators (Success/Warning).
|
|
|
|
### Story 4.4: Génération du Rapport PDF (Audit Trail)
|
|
As a Julien (Analyst),
|
|
I want to export my findings as a professional PDF report,
|
|
So that I can share and archive my validated analysis.
|
|
|
|
**Acceptance Criteria:**
|
|
**Given** A completed analysis session.
|
|
**When** I click "Export PDF".
|
|
**Then** A PDF is generated containing all charts, metrics, and a reproducibility section (lib versions, seeds).
|
|
**And** The report lists all rows that were excluded during the session. |