---
stepsCompleted: [1, 2, 3]
inputDocuments: ['_bmad-output/planning-artifacts/prd.md', '_bmad-output/planning-artifacts/architecture.md', '_bmad-output/planning-artifacts/ux-design-specification.md']
---

# Data_analysis - Epic Breakdown

## Overview

This document provides the complete epic and story breakdown for Data_analysis, decomposing the requirements from the PRD, UX Design if it exists, and Architecture requirements into implementable stories.

## Requirements Inventory

### Functional Requirements

- **FR1:** Users can upload datasets in .xlsx, .xls, and .csv formats via drag-and-drop or file selection.
- **FR2:** System automatically detects column data types (numeric, categorical, datetime) upon ingest.
- **FR3:** Users can manually override detected data types if the inference is incorrect.
- **FR4:** Users can rename columns directly in the interface to sanitize inputs.
- **FR5:** Users can view loaded data in a paginated, virtualized grid capable of displaying 50,000+ rows.
- **FR6:** Users can edit cell values directly (double-click to edit) with inputs validated against the column type.
- **FR7:** Users can sort columns (asc/desc) and filter rows based on values/conditions (e.g., "> 100").
- **FR8:** Users can perform Undo/Redo operations (Ctrl+Z/Ctrl+Y) on data edits within the current session.
- **FR9:** Users can exclude specific rows from analysis without deleting them (soft delete/toggle).
- **FR10:** System automatically identifies univariate outliers using IQR/Z-score and visualizes them in the grid/plots.
- **FR11:** System automatically identifies multivariate outliers using Isolation Forest upon user request.
- **FR12:** Users can accept or reject outlier exclusion proposals individually or in bulk.
- **FR13:** Users can select a Target Variable (Y) to trigger an automated Feature Importance analysis.
- **FR14:** System recommends the Top-N predictive features based on RFE (Recursive Feature Elimination) or Random Forest importance.
- **FR15:** Users can configure a Linear Regression (Simple/Multiple) by selecting Dependent (Y) and Independent (X) variables.
- **FR16:** Users can configure a Binary Logistic Regression for categorical target variables.
- **FR17:** System generates a "Model Summary" including R-squared, Adjusted R-squared, F-statistic, and P-values for coefficients.
- **FR18:** System generates standard diagnostic plots: Residuals vs Fitted, Q-Q Plot, and Scale-Location.
- **FR19:** Users can view a Correlation Matrix (Heatmap) for selected numeric variables.
- **FR20:** Users can view an interactive "Analysis Report" dashboard summarizing data health, methodology, and model results.
- **FR21:** Users can export the full report as a branded PDF document.
- **FR22:** System appends an "Audit Trail" to the report listing library versions, random seeds, and data exclusion steps for reproducibility.

### NonFunctional Requirements

- **Performance:** Grid latency < 200ms for 50k rows. Analysis throughput < 15s. Upload speed < 3s for 5MB.
- **Security:** Data ephemerality (purge after 1h). TLS 1.3 encryption. Input sanitization for files.
- **Reliability:** Graceful degradation for bad data. Support 50 concurrent requests via async task queue.
- **Accessibility:** Keyboard navigation for "Smart Grid". Screen reader support. WCAG 2.1 Level AA compliance.

### Additional Requirements

**Architecture:**
- **Starter Template:** Custom FastAPI-Next.js-Docker Boilerplate.
- **Data Serialization:** Apache Arrow (IPC Stream) required for grid data.
- **State Management:** Hybrid approach (TanStack Query for Server State + Zustand for Grid UI State).
- **Deployment:** "Two-Service" model on Homelab via Docker Compose.
- **Naming Conventions:** `snake_case` for Python/API, `PascalCase` for React components.
- **Testing:** Pytest (Backend) and Vitest (Frontend).

**UX Design:**
- **Visual Style:** "Lab & Tech" (Slate/Indigo/Mono) with Shadcn UI.
- **Responsive:** Desktop Only (1366px+).
- **Core Interaction:** "Guided Data Hygiene Loop" (Insight Panel).
- **Design System:** TanStack Table for virtualization + Recharts for visualization.
- **Mode:** Native Dark Mode support.

### FR Coverage Map

- **FR1:** Epic 1 - Data Ingestion
- **FR2:** Epic 1 - Type Auto-detection
- **FR3:** Epic 1 - Manual Type Override
- **FR4:** Epic 1 - Column Renaming
- **FR5:** Epic 1 - High-Performance Grid View
- **FR6:** Epic 2 - Grid Cell Editing
- **FR7:** Epic 1 - Grid Sort/Filter
- **FR8:** Epic 2 - Edit Undo/Redo
- **FR9:** Epic 2 - Row Exclusion Logic
- **FR10:** Epic 2 - Univariate Outlier Detection
- **FR11:** Epic 2 - Multivariate Outlier Detection
- **FR12:** Epic 2 - Outlier Review UI (Insight Panel)
- **FR13:** Epic 3 - Feature Importance Engine
- **FR14:** Epic 3 - Smart Feature Recommendation
- **FR15:** Epic 4 - Linear Regression Configuration
- **FR16:** Epic 4 - Logistic Regression Configuration
- **FR17:** Epic 4 - Model Summary & Metrics
- **FR18:** Epic 4 - Diagnostic Plots
- **FR19:** Epic 3 - Correlation Matrix Visualization
- **FR20:** Epic 4 - Interactive Analysis Dashboard
- **FR21:** Epic 4 - PDF Export
- **FR22:** Epic 4 - Reproducibility Audit Trail

## Epic List

### Epic 1: Fondation & Ingestion de Données
"Je peux uploader mon fichier Excel et voir mes données brutes dans une grille fluide."
**FRs covered:** FR1, FR2, FR3, FR4, FR5, FR7.

### Epic 2: Nettoyage Interactif (Hygiene Loop)
"Je peux nettoyer mes données en supprimant les lignes erronées ou les outliers."
**FRs covered:** FR6, FR8, FR9, FR10, FR11, FR12.

### Epic 3: Intelligence & Sélection (Smart Prep)
"Le système me dit quelles variables sont importantes pour ma cible."
**FRs covered:** FR13, FR14, FR19.

### Epic 4: Modélisation & Reporting
"Je génère mon modèle de régression et j'exporte le rapport PDF."
**FRs covered:** FR15, FR16, FR17, FR18, FR20, FR21, FR22.

---

## Epic 1: Fondation & Ingestion de Données

"Je peux uploader mon fichier Excel et voir mes données brutes dans une grille fluide."

### Story 1.1: Initialisation du Monorepo & Docker
As a developer,
I want to initialize the project structure (Next.js + FastAPI + Docker),
So that I have a functional and consistent development environment.

**Acceptance Criteria:**
**Given** A fresh project directory.
**When** I run `docker-compose up`.
**Then** Both the Next.js frontend and FastAPI backend are reachable on their respective ports.
**And** The shared Docker network allows communication between services.

### Story 1.2: Ingestion de Fichiers Excel/CSV (Backend)
As a Julien (Analyst),
I want to upload an Excel or CSV file,
So that the system can read my production data.

**Acceptance Criteria:**
**Given** A valid `.xlsx` file with multiple columns and 5,000 rows.
**When** I POST the file to the `/upload` endpoint.
**Then** The backend returns a 200 OK with column metadata (names, detected types).
**And** The data is prepared as an Apache Arrow stream for high-performance delivery.

### Story 1.3: Visualisation dans la Smart Grid (Frontend)
As a Julien (Analyst),
I want to see my uploaded data in an interactive high-speed grid,
So that I can explore the raw data effortlessly.

**Acceptance Criteria:**
**Given** A dataset successfully loaded in the backend.
**When** I view the workspace page.
**Then** The TanStack Table renders the data using virtualization.
**And** Scrolling through 50,000 rows remains fluid (< 200ms latency).

### Story 1.4: Gestion des Types & Renommage (Data Hygiene)
As a Julien (Analyst),
I want to rename columns and correct data types,
So that the data matches my business context before analysis.

**Acceptance Criteria:**
**Given** A column "Press_01" detected as 'text'.
**When** I click the column header to rename it to "Pressure" and change type to 'numeric'.
**Then** The grid updates the visual formatting immediately.
**And** The backend validates that all values in the column can be cast to numeric.

### Story 1.5: Tri & Filtrage de Base
As a Julien (Analyst),
I want to sort and filter my data in the grid,
So that I can identify extreme values or specific subsets.

**Acceptance Criteria:**
**Given** A column "Temperature".
**When** I click 'Sort Descending'.
**Then** The highest temperature values appear at the top of the grid instantly.

---

## Epic 2: Nettoyage Interactif (Hygiene Loop)

"Je peux nettoyer mes données en supprimant les lignes erronées ou les outliers."

### Story 2.1: Édition de Cellule & Validation
As a Julien (Analyst),
I want to edit cell values directly in the grid,
So that I can manually correct obvious data entry errors.

**Acceptance Criteria:**
**Given** A data cell in the grid.
**When** I double-click the cell and enter a new value.
**Then** The value is updated in the local UI state (Zustand).
**And** The system validates the input against the column's data type (e.g., no text in numeric columns).

### Story 2.2: Undo/Redo des Modifications
As a Julien (Analyst),
I want to undo my last data edits,
So that I can explore changes without fear of losing the original data.

**Acceptance Criteria:**
**Given** A cell value was modified.
**When** I press `Ctrl+Z` (or click Undo).
**Then** The cell reverts to its previous value.
**And** `Ctrl+Y` (Redo) restores the edit.

### Story 2.3: Détection Automatique des Outliers (Backend)
As a system,
I want to identify statistical outliers in the background,
So that I can alert the user to potential data quality issues.

**Acceptance Criteria:**
**Given** A dataset is loaded.
**When** The analysis engine runs.
**Then** It uses Isolation Forest (multivariate) and IQR (univariate) to tag suspicious rows.
**And** Outlier coordinates are returned to the frontend.

### Story 2.4: Panel d'Insights & Revue des Outliers (Frontend)
As a Julien (Analyst),
I want to review detected outliers in a side panel,
So that I can understand why they are flagged before excluding them.

**Acceptance Criteria:**
**Given** Flagged outliers exist.
**When** I click the warning icon in a column header.
**Then** The `InsightPanel` opens with a boxplot visualization and a "Why?" explanation.
**And** A button "Exclude all 34 outliers" is prominently displayed.

### Story 2.5: Exclusion Non-Destructive de Données
As a Julien (Analyst),
I want to toggle the inclusion of specific rows in the analysis,
So that I can test different scenarios without deleting data.

**Acceptance Criteria:**
**Given** A flagged outlier row.
**When** I click "Exclude".
**Then** The row appears with 40% opacity in the grid.
**And** The row is ignored by all subsequent statistical calculations (R², Regression).

---

## Epic 3: Intelligence & Sélection (Smart Prep)

"Le système me dit quelles variables sont importantes pour ma cible."

### Story 3.1: Matrice de Corrélation Interactive
As a Julien (Analyst),
I want to see a visual correlation map of my numeric variables,
So that I can quickly identify which factors are related.

**Acceptance Criteria:**
**Given** A dataset with multiple numeric columns.
**When** I navigate to the "Correlations" tab.
**Then** A heatmap is displayed using Pearson correlation coefficients.
**And** Hovering over a cell shows the precise correlation value.

### Story 3.2: Calcul de l'Importance des Features (Backend)
As a system,
I want to compute the predictive power of features against a target variable,
So that I can provide scientific recommendations to the user.

**Acceptance Criteria:**
**Given** A dataset and a selected Target Variable (Y).
**When** The RFE (Recursive Feature Elimination) algorithm runs.
**Then** The backend returns an ordered list of features with their importance scores.

### Story 3.3: Recommandation Intelligente de Variables (Frontend)
As a Julien (Analyst),
I want the system to suggest which variables to include in my model,
So that I don't pollute my analysis with irrelevant data ("noise").

**Acceptance Criteria:**
**Given** Feature importance scores are calculated.
**When** I open the Model Configuration panel.
**Then** The top 5 predictive variables are pre-selected by default.
**And** An explanation "Why?" is available for each recommendation.

---

## Epic 4: Modélisation & Reporting

"Je génère mon modèle de régression et j'exporte le rapport PDF."

### Story 4.1: Configuration de la Régression
As a Julien (Analyst),
I want to configure the parameters of my regression model,
So that I can tailor the analysis to my specific hypothesis.

**Acceptance Criteria:**
**Given** A cleaned dataset.
**When** I select "Linear Regression" and confirm X/Y variables.
**Then** The system validates that the target variable (Y) is suitable for the chosen model type.

### Story 4.2: Exécution du Modèle (Backend)
As a system,
I want to execute the statistical model computation,
So that I can provide accurate regression results.

**Acceptance Criteria:**
**Given** Model parameters (X, Y, Algorithm).
**When** The "Run" action is triggered.
**Then** The backend computes R², Adjusted R², P-values, and coefficients using `statsmodels`.
**And** All results are returned as a JSON summary.

### Story 4.3: Dashboard de Résultats Interactif
As a Julien (Analyst),
I want to see the model results through interactive charts,
So that I can easily diagnose the performance of my regression.

**Acceptance Criteria:**
**Given** Computed model results.
**When** I view the "Results" page.
**Then** I see a "Real vs Predicted" scatter plot and a "Residuals" plot.
**And** Key metrics (R², P-value) are displayed with colored status indicators (Success/Warning).

### Story 4.4: Génération du Rapport PDF (Audit Trail)
As a Julien (Analyst),
I want to export my findings as a professional PDF report,
So that I can share and archive my validated analysis.

**Acceptance Criteria:**
**Given** A completed analysis session.
**When** I click "Export PDF".
**Then** A PDF is generated containing all charts, metrics, and a reproducibility section (lib versions, seeds).
**And** The report lists all rows that were excluded during the session.