--- stepsCompleted: [1, 2, 3] inputDocuments: ['_bmad-output/planning-artifacts/prd.md', '_bmad-output/planning-artifacts/architecture.md', '_bmad-output/planning-artifacts/ux-design-specification.md'] --- # Data_analysis - Epic Breakdown ## Overview This document provides the complete epic and story breakdown for Data_analysis, decomposing the requirements from the PRD, UX Design if it exists, and Architecture requirements into implementable stories. ## Requirements Inventory ### Functional Requirements - **FR1:** Users can upload datasets in .xlsx, .xls, and .csv formats via drag-and-drop or file selection. - **FR2:** System automatically detects column data types (numeric, categorical, datetime) upon ingest. - **FR3:** Users can manually override detected data types if the inference is incorrect. - **FR4:** Users can rename columns directly in the interface to sanitize inputs. - **FR5:** Users can view loaded data in a paginated, virtualized grid capable of displaying 50,000+ rows. - **FR6:** Users can edit cell values directly (double-click to edit) with inputs validated against the column type. - **FR7:** Users can sort columns (asc/desc) and filter rows based on values/conditions (e.g., "> 100"). - **FR8:** Users can perform Undo/Redo operations (Ctrl+Z/Ctrl+Y) on data edits within the current session. - **FR9:** Users can exclude specific rows from analysis without deleting them (soft delete/toggle). - **FR10:** System automatically identifies univariate outliers using IQR/Z-score and visualizes them in the grid/plots. - **FR11:** System automatically identifies multivariate outliers using Isolation Forest upon user request. - **FR12:** Users can accept or reject outlier exclusion proposals individually or in bulk. - **FR13:** Users can select a Target Variable (Y) to trigger an automated Feature Importance analysis. - **FR14:** System recommends the Top-N predictive features based on RFE (Recursive Feature Elimination) or Random Forest importance. - **FR15:** Users can configure a Linear Regression (Simple/Multiple) by selecting Dependent (Y) and Independent (X) variables. - **FR16:** Users can configure a Binary Logistic Regression for categorical target variables. - **FR17:** System generates a "Model Summary" including R-squared, Adjusted R-squared, F-statistic, and P-values for coefficients. - **FR18:** System generates standard diagnostic plots: Residuals vs Fitted, Q-Q Plot, and Scale-Location. - **FR19:** Users can view a Correlation Matrix (Heatmap) for selected numeric variables. - **FR20:** Users can view an interactive "Analysis Report" dashboard summarizing data health, methodology, and model results. - **FR21:** Users can export the full report as a branded PDF document. - **FR22:** System appends an "Audit Trail" to the report listing library versions, random seeds, and data exclusion steps for reproducibility. ### NonFunctional Requirements - **Performance:** Grid latency < 200ms for 50k rows. Analysis throughput < 15s. Upload speed < 3s for 5MB. - **Security:** Data ephemerality (purge after 1h). TLS 1.3 encryption. Input sanitization for files. - **Reliability:** Graceful degradation for bad data. Support 50 concurrent requests via async task queue. - **Accessibility:** Keyboard navigation for "Smart Grid". Screen reader support. WCAG 2.1 Level AA compliance. ### Additional Requirements **Architecture:** - **Starter Template:** Custom FastAPI-Next.js-Docker Boilerplate. - **Data Serialization:** Apache Arrow (IPC Stream) required for grid data. - **State Management:** Hybrid approach (TanStack Query for Server State + Zustand for Grid UI State). - **Deployment:** "Two-Service" model on Homelab via Docker Compose. - **Naming Conventions:** `snake_case` for Python/API, `PascalCase` for React components. - **Testing:** Pytest (Backend) and Vitest (Frontend). **UX Design:** - **Visual Style:** "Lab & Tech" (Slate/Indigo/Mono) with Shadcn UI. - **Responsive:** Desktop Only (1366px+). - **Core Interaction:** "Guided Data Hygiene Loop" (Insight Panel). - **Design System:** TanStack Table for virtualization + Recharts for visualization. - **Mode:** Native Dark Mode support. ### FR Coverage Map - **FR1:** Epic 1 - Data Ingestion - **FR2:** Epic 1 - Type Auto-detection - **FR3:** Epic 1 - Manual Type Override - **FR4:** Epic 1 - Column Renaming - **FR5:** Epic 1 - High-Performance Grid View - **FR6:** Epic 2 - Grid Cell Editing - **FR7:** Epic 1 - Grid Sort/Filter - **FR8:** Epic 2 - Edit Undo/Redo - **FR9:** Epic 2 - Row Exclusion Logic - **FR10:** Epic 2 - Univariate Outlier Detection - **FR11:** Epic 2 - Multivariate Outlier Detection - **FR12:** Epic 2 - Outlier Review UI (Insight Panel) - **FR13:** Epic 3 - Feature Importance Engine - **FR14:** Epic 3 - Smart Feature Recommendation - **FR15:** Epic 4 - Linear Regression Configuration - **FR16:** Epic 4 - Logistic Regression Configuration - **FR17:** Epic 4 - Model Summary & Metrics - **FR18:** Epic 4 - Diagnostic Plots - **FR19:** Epic 3 - Correlation Matrix Visualization - **FR20:** Epic 4 - Interactive Analysis Dashboard - **FR21:** Epic 4 - PDF Export - **FR22:** Epic 4 - Reproducibility Audit Trail ## Epic List ### Epic 1: Fondation & Ingestion de Données "Je peux uploader mon fichier Excel et voir mes données brutes dans une grille fluide." **FRs covered:** FR1, FR2, FR3, FR4, FR5, FR7. ### Epic 2: Nettoyage Interactif (Hygiene Loop) "Je peux nettoyer mes données en supprimant les lignes erronées ou les outliers." **FRs covered:** FR6, FR8, FR9, FR10, FR11, FR12. ### Epic 3: Intelligence & Sélection (Smart Prep) "Le système me dit quelles variables sont importantes pour ma cible." **FRs covered:** FR13, FR14, FR19. ### Epic 4: Modélisation & Reporting "Je génère mon modèle de régression et j'exporte le rapport PDF." **FRs covered:** FR15, FR16, FR17, FR18, FR20, FR21, FR22. --- ## Epic 1: Fondation & Ingestion de Données "Je peux uploader mon fichier Excel et voir mes données brutes dans une grille fluide." ### Story 1.1: Initialisation du Monorepo & Docker As a developer, I want to initialize the project structure (Next.js + FastAPI + Docker), So that I have a functional and consistent development environment. **Acceptance Criteria:** **Given** A fresh project directory. **When** I run `docker-compose up`. **Then** Both the Next.js frontend and FastAPI backend are reachable on their respective ports. **And** The shared Docker network allows communication between services. ### Story 1.2: Ingestion de Fichiers Excel/CSV (Backend) As a Julien (Analyst), I want to upload an Excel or CSV file, So that the system can read my production data. **Acceptance Criteria:** **Given** A valid `.xlsx` file with multiple columns and 5,000 rows. **When** I POST the file to the `/upload` endpoint. **Then** The backend returns a 200 OK with column metadata (names, detected types). **And** The data is prepared as an Apache Arrow stream for high-performance delivery. ### Story 1.3: Visualisation dans la Smart Grid (Frontend) As a Julien (Analyst), I want to see my uploaded data in an interactive high-speed grid, So that I can explore the raw data effortlessly. **Acceptance Criteria:** **Given** A dataset successfully loaded in the backend. **When** I view the workspace page. **Then** The TanStack Table renders the data using virtualization. **And** Scrolling through 50,000 rows remains fluid (< 200ms latency). ### Story 1.4: Gestion des Types & Renommage (Data Hygiene) As a Julien (Analyst), I want to rename columns and correct data types, So that the data matches my business context before analysis. **Acceptance Criteria:** **Given** A column "Press_01" detected as 'text'. **When** I click the column header to rename it to "Pressure" and change type to 'numeric'. **Then** The grid updates the visual formatting immediately. **And** The backend validates that all values in the column can be cast to numeric. ### Story 1.5: Tri & Filtrage de Base As a Julien (Analyst), I want to sort and filter my data in the grid, So that I can identify extreme values or specific subsets. **Acceptance Criteria:** **Given** A column "Temperature". **When** I click 'Sort Descending'. **Then** The highest temperature values appear at the top of the grid instantly. --- ## Epic 2: Nettoyage Interactif (Hygiene Loop) "Je peux nettoyer mes données en supprimant les lignes erronées ou les outliers." ### Story 2.1: Édition de Cellule & Validation As a Julien (Analyst), I want to edit cell values directly in the grid, So that I can manually correct obvious data entry errors. **Acceptance Criteria:** **Given** A data cell in the grid. **When** I double-click the cell and enter a new value. **Then** The value is updated in the local UI state (Zustand). **And** The system validates the input against the column's data type (e.g., no text in numeric columns). ### Story 2.2: Undo/Redo des Modifications As a Julien (Analyst), I want to undo my last data edits, So that I can explore changes without fear of losing the original data. **Acceptance Criteria:** **Given** A cell value was modified. **When** I press `Ctrl+Z` (or click Undo). **Then** The cell reverts to its previous value. **And** `Ctrl+Y` (Redo) restores the edit. ### Story 2.3: Détection Automatique des Outliers (Backend) As a system, I want to identify statistical outliers in the background, So that I can alert the user to potential data quality issues. **Acceptance Criteria:** **Given** A dataset is loaded. **When** The analysis engine runs. **Then** It uses Isolation Forest (multivariate) and IQR (univariate) to tag suspicious rows. **And** Outlier coordinates are returned to the frontend. ### Story 2.4: Panel d'Insights & Revue des Outliers (Frontend) As a Julien (Analyst), I want to review detected outliers in a side panel, So that I can understand why they are flagged before excluding them. **Acceptance Criteria:** **Given** Flagged outliers exist. **When** I click the warning icon in a column header. **Then** The `InsightPanel` opens with a boxplot visualization and a "Why?" explanation. **And** A button "Exclude all 34 outliers" is prominently displayed. ### Story 2.5: Exclusion Non-Destructive de Données As a Julien (Analyst), I want to toggle the inclusion of specific rows in the analysis, So that I can test different scenarios without deleting data. **Acceptance Criteria:** **Given** A flagged outlier row. **When** I click "Exclude". **Then** The row appears with 40% opacity in the grid. **And** The row is ignored by all subsequent statistical calculations (R², Regression). --- ## Epic 3: Intelligence & Sélection (Smart Prep) "Le système me dit quelles variables sont importantes pour ma cible." ### Story 3.1: Matrice de Corrélation Interactive As a Julien (Analyst), I want to see a visual correlation map of my numeric variables, So that I can quickly identify which factors are related. **Acceptance Criteria:** **Given** A dataset with multiple numeric columns. **When** I navigate to the "Correlations" tab. **Then** A heatmap is displayed using Pearson correlation coefficients. **And** Hovering over a cell shows the precise correlation value. ### Story 3.2: Calcul de l'Importance des Features (Backend) As a system, I want to compute the predictive power of features against a target variable, So that I can provide scientific recommendations to the user. **Acceptance Criteria:** **Given** A dataset and a selected Target Variable (Y). **When** The RFE (Recursive Feature Elimination) algorithm runs. **Then** The backend returns an ordered list of features with their importance scores. ### Story 3.3: Recommandation Intelligente de Variables (Frontend) As a Julien (Analyst), I want the system to suggest which variables to include in my model, So that I don't pollute my analysis with irrelevant data ("noise"). **Acceptance Criteria:** **Given** Feature importance scores are calculated. **When** I open the Model Configuration panel. **Then** The top 5 predictive variables are pre-selected by default. **And** An explanation "Why?" is available for each recommendation. --- ## Epic 4: Modélisation & Reporting "Je génère mon modèle de régression et j'exporte le rapport PDF." ### Story 4.1: Configuration de la Régression As a Julien (Analyst), I want to configure the parameters of my regression model, So that I can tailor the analysis to my specific hypothesis. **Acceptance Criteria:** **Given** A cleaned dataset. **When** I select "Linear Regression" and confirm X/Y variables. **Then** The system validates that the target variable (Y) is suitable for the chosen model type. ### Story 4.2: Exécution du Modèle (Backend) As a system, I want to execute the statistical model computation, So that I can provide accurate regression results. **Acceptance Criteria:** **Given** Model parameters (X, Y, Algorithm). **When** The "Run" action is triggered. **Then** The backend computes R², Adjusted R², P-values, and coefficients using `statsmodels`. **And** All results are returned as a JSON summary. ### Story 4.3: Dashboard de Résultats Interactif As a Julien (Analyst), I want to see the model results through interactive charts, So that I can easily diagnose the performance of my regression. **Acceptance Criteria:** **Given** Computed model results. **When** I view the "Results" page. **Then** I see a "Real vs Predicted" scatter plot and a "Residuals" plot. **And** Key metrics (R², P-value) are displayed with colored status indicators (Success/Warning). ### Story 4.4: Génération du Rapport PDF (Audit Trail) As a Julien (Analyst), I want to export my findings as a professional PDF report, So that I can share and archive my validated analysis. **Acceptance Criteria:** **Given** A completed analysis session. **When** I click "Export PDF". **Then** A PDF is generated containing all charts, metrics, and a reproducibility section (lib versions, seeds). **And** The report lists all rows that were excluded during the session.