2026-01-11 22:56:02 +01:00

20 KiB

stepsCompleted inputDocuments workflowType lastStep project_name user_name date briefCount researchCount brainstormingCount projectDocsCount
1
2
3
4
5
6
7
8
9
10
11
prd 11 Data_analysis Sepehr 2026-01-10 0 0 0 0

Product Requirements Document - Data_analysis

Author: Sepehr Date: 2026-01-10

Executive Summary

Data_analysis aims to democratize advanced statistical analysis by combining the robustness of Python's scientific ecosystem with the accessibility of a modern web interface. It serves as a web-based, modern alternative to Minitab, specifically optimized for regression analysis workflows. The platform empowers users—from data novices to analysts—to upload datasets (Excel/CSV), intuitively clean and manipulate data in an Excel-like grid, and perform sophisticated regression modeling with automated guidance.

What Makes This Special

  • Guided Analytical Workflow: Unlike traditional tools that present a toolbox, Data_analysis guides the user through a best-practice workflow: Data Upload -> Auto-Outlier Detection -> Smart Feature Selection -> Regression Modeling -> Explainable Results.
  • Hybrid Interface: Merges the familiarity of spreadsheet editing (direct cell manipulation, copy-paste) with the power of a computational notebook, eliminating the need to switch between Excel and statistical software.
  • Modern Tech Stack: Built on a robust Python backend (FastAPI/Django) for heavy statistical lifting (Pandas, Scikit-learn, Statsmodels) and a high-performance Next.js 16 frontend with Tailwind CSS and Shadcn UI, ensuring a fast, responsive, and visually appealing experience.
  • Automated Insights: Proactively identifies data quality issues (outliers) and relevant predictors (feature selection) using advanced algorithms (e.g., Isolation Forest, Recursive Feature Elimination), visualizing why certain features are selected before the user commits to a model.

Key Workflows & Scenarios

  1. "Auto-Optimisation" (Scénario Cœur):
    • Upload de fichier -> Détection automatique des Outliers (Isolation Forest) avec proposition de correction.
    • Sélection automatique de Features (RFE/Lasso) pour identifier les variables clés.
    • Régression Robuste finale pour un modèle moins sensible au bruit.
  2. Modern Minitab Classics:
    • Régression Linéaire Simple & Multiple: Interface interactive pour visualiser la droite de régression, les résidus et les métriques (R²) avec alertes automatiques sur les hypothèses.
    • Régression Logistique Binaire: Pour les prédictions Oui/Non avec matrices de confusion et courbes ROC interactives.
  3. Comparaison de Modèles (Benchmark):
    • Lancement parallèle de plusieurs algorithmes (Régression Linéaire, Random Forest, XGBoost) pour recommander le plus performant.

Project Classification

Technical Type: web_app Domain: scientific Complexity: medium Project Context: Greenfield - new project

Success Criteria

User Success

  • Efficiency: Complete a full regression cycle (Upload -> Cleaning -> Modeling) in under 3 minutes for typical datasets.
  • Cognitive Load Reduction: Users feel guided and confident without needing deep statistical expertise; "Explainable AI" visuals clarify outlier removal and feature selection.
  • Data Mastery: Users gain comprehensive insights into their data quality and variable importance through automated profiling.

Business Success

  • Operational Speed: Significant reduction in man-hours spent on manual data cleaning and repetitive modeling tasks.
  • Platform Adoption: High user retention rate by providing a "least head-scratching" experience compared to Minitab or raw Excel.

Technical Success

  • High-Performance Grid: Excel-like interface handles 50k+ rows with sub-second latency for filtering and sorting.
  • Algorithmic Integrity: Reliable Python backend providing accurate statistical outputs consistent with industry-standard libraries (Scikit-learn, Statsmodels).

Product Scope

MVP - Minimum Viable Product

  • Direct Data Import: Robust Excel/CSV upload.
  • Smart Data Grid: Seamless cell editing, filtering, and sorting in a modern web UI.
  • Automated Data Preparation: Integrated Outlier Detection (visual) and Feature Selection algorithms.
  • Regression Core: High-quality Linear (Simple/Multiple) regression output with clear diagnostics.

Growth Features (Post-MVP)

  • Binary Logistic Regression: Support for classification-based predictive modeling.
  • Model Benchmark: Automated "tournament" mode to find the best performing algorithm.
  • Advanced Reporting: Exportable dashboard summaries for stakeholder presentations.

Vision (Future)

  • Time Series Forecasting: Expansion into temporal data prediction.
  • Native Integration: Two-way sync with Excel/Cloud Storage providers.

User Journeys

Journey 1: Julien, l'Ingénieur Qualité - La Course contre la Montre Julien est sous pression. Une ligne de production vient de signaler une dérive anormale sur des composants électroniques. Il est 11h, et il doit présenter une analyse de régression à son directeur à 14h pour décider s'il faut arrêter la production. Il a exporté un fichier Excel complexe avec 40 variables (température, humidité, pression, etc.). Habituellement, il perdrait une heure rien qu'à nettoyer les données et à essayer de comprendre quelles variables sont pertinentes.

Il ouvre Data_analysis. Il glisse son fichier Excel dans l'interface. Immédiatement, il voit ses données dans une grille familière. Le système fait clignoter un message : "34 outliers détectés dans la colonne 'Pression_Zone_B'". Julien clique, visualise les points rouges sur un graphique et décide de les exclure en un clic. Ensuite, il lance la "Smart Feature Selection". Le système lui explique : "Les variables Température_C et Vitesse_Tapis expliquent 85% de la dérive". À 11h15, Julien a déjà son modèle de régression validé et une visualisation claire. Il se sent serein pour sa réunion : il ne présente pas juste des chiffres, il présente une solution validée.

Journey 2: Sarah, l'Administratrice IT - La Gestion sans Stress Sarah doit s'assurer que les outils utilisés par les ingénieurs sont sécurisés et ne saturent pas les ressources du serveur. Elle se connecte à son tableau de bord Data_analysis. Elle peut voir en un coup d'œil le nombre d'analyses en cours et la mémoire consommée par le backend Python. Elle crée un nouvel accès pour un stagiaire en quelques secondes. Pour elle, le succès, c'est que personne ne l'appelle pour dire "le logiciel a planté" parce qu'un fichier était trop gros.

Journey 3: Marc, le Directeur de Production - La Décision Rapide Marc reçoit un lien de la part de Julien. Il n'est pas statisticien. Il ouvre le lien sur sa tablette. Il ne voit pas des lignes de code, mais un rapport interactif. Il voit le graphique de régression, lit l'explication simplifiée ("La vitesse du tapis est le facteur clé") et clique sur le PDF pour l'archiver. Il a pu prendre la décision d'ajuster la vitesse du tapis en 2 minutes, sauvant ainsi la production de l'après-midi.

Journey Requirements Summary

For Julien (Analyst):

  • Fast Data Ingestion: Drag-and-drop Excel/CSV support.
  • Visual Data Cleaning: Automated outlier detection with interactive exclusion.
  • Explainable ML: Feature selection that explains "why" (percentage of variance explained).
  • Validation: Clear regression metrics (R², P-values) presented simply.

For Sarah (Admin):

  • System Health: Dashboard for monitoring backend resources (Python server load).
  • Access Control: Simple user management (RBAC).
  • Stability: Robust error handling for large files to prevent system crashes.

For Marc (Consumer):

  • Accessibility: Mobile/Tablet responsive view for reports.
  • Simplicity: "Read-only" mode with simplified insights (no code/formulas).
  • Portability: One-click PDF export for archiving/sharing.

Domain-Specific Requirements

Scientific Validation & Reproducibility

Data_analysis must adhere to strict scientific rigor to be a credible Minitab alternative. Users rely on these results for quality control and critical decision-making.

Key Domain Concerns

  • Reproducibility: Ensuring identical inputs yield identical outputs, regardless of when or where the analysis is run.
  • Methodological Transparency: Avoiding "black box" algorithms; users must understand how an outlier was detected.
  • Computational Integrity: Handling floating-point precision and large matrix operations without degradation.

Compliance Requirements

  • Audit Trail: Every generated report must include an appendix listing:
    • Software Version & Library Versions (Pandas/Scikit-learn versions).
    • Random Seed used for stochastic processes (Isolation Forest, train/test split).
    • Sequence of applied filters (e.g., "Row 45 excluded due to Z-score > 3").

Industry Standards & Best Practices

  • Statistical Standards: Use statsmodels for classical regression (p-values, confidence intervals) to match traditional statistical expectations, and scikit-learn for predictive tasks.
  • Visual Standards: Error bars, confidence bands, and residual plots must follow standard scientific visualization conventions (e.g., Q-Q plots for normality).

Required Expertise & Validation

  • Validation Methodology:
    • Unit Tests for Math: Verify regression outputs against known standard datasets (e.g., Anscombe's quartet).
    • Drift Detection: Alert users if data distribution significantly deviates from assumptions (e.g., normality check for linear regression).

Implementation Considerations

  • Asynchronous Processing: Heavy computations (Feature Selection on >10k rows) must be offloaded to a background worker (Celery/Redis) to maintain UI responsiveness.
  • Fixed Random Seeds: All stochastic algorithms must imply a fixed random state by default to ensure consistency, with an option for the user to change it.

Innovation & Novel Patterns

Detected Innovation Areas

  • Hybrid "Spreadsheet-Notebook" Interface:

    • Concept: Combines the low barrier to entry of a spreadsheet (Excel) with the computational power and reproducibility of a code notebook (Jupyter), without requiring the user to write code.
    • Differentiation: Traditional tools are either "click-heavy" (Minitab) or "code-heavy" (Python/R). Data_analysis sits in the "sweet spot" of No-Code Data Science with full transparency.
  • Guided "GPS" Workflow:

    • Concept: Instead of a passive toolbox, the system actively guides the analysis. It doesn't just ask "What model do you want?", it suggests "Your data has outliers, let's fix them first" and "These 3 variables are the most predictive."
    • Differentiation: Moves from "User-Driven Analysis" to "Assisted Analysis", reducing the risk of statistical errors by non-experts.
  • Explainable AI (XAI) for Quality:

    • Concept: Using advanced algorithms (Isolation Forest) not just to remove bad data, but to explain why it's bad visually.
    • Differentiation: Makes complex ML concepts accessible to domain experts (e.g., Quality Engineers) who understand the context but not necessarily the algorithm.

Market Context & Competitive Landscape

  • Legacy Players: Minitab, SPSS (Powerful but expensive, dated UI, steep learning curve).
  • Modern Data Tools: Tableau, PowerBI (Great for visualization, weak for advanced statistical regression).
  • Code-Based: Jupyter, Streamlit (Powerful but requires coding skills).
  • Opportunity: Data_analysis fills the gap for a Modern, Web-Based, Statistical Power Tool for non-coders.

Validation Approach

  • User Testing: Compare time-to-insight between Data_analysis and Minitab for a standard regression task.
  • Side-by-Side Benchmark: Run the same dataset through Minitab and Data_analysis to validate numerical accuracy (ensure results match to 4 decimal places).

Risk Mitigation

  • "Black Box" Trust: Users might not trust automated suggestions.
    • Mitigation: Always provide a "Show Details" view with raw statistical metrics (p-values) to prove the "why".
  • Performance: Python backend might lag on large Excel files.
    • Mitigation: Implemented async task queue (Celery) and progressive loading for the frontend grid.

Web App Specific Requirements

Project-Type Overview

As a scientific web application, Data_analysis prioritizes data integrity and high-performance interactivity. The technical architecture must support heavy client-side state management (for the grid) while leveraging robust backend statistical processing.

Technical Architecture Considerations

  • Rendering Strategy:
    • Shell & Reports: Next.js Server Components for optimized performance and SEO (if public).
    • Data Grid: React Client Components to manage complex state transitions, cell editing, and local filtering with sub-second latency.
  • Data Persistence:
    • Session-based Workspace: Users work on a "Project" basis; files are uploaded to temporary storage for analysis, with an option to persist to a database (PostgreSQL) for long-term tracking.
  • Browser Strategy: Support for modern "Evergreen" browsers (Chrome, Edge, Firefox, Safari). High-performance features like Web Workers may be used for local data transformations.

Functional Requirements (Web-Specific)

  • Excel-like Interactions: Support for keyboard shortcuts (Ctrl+C/V, Undo/Redo), drag-to-fill (Growth), and multi-cell selection.
  • Responsive Analysis: The interface must adapt for "Marc's Journey" (Manager/Consumer) on tablets, while ensuring "Julien's Journey" (Analyst) is optimized for high-resolution desktop displays.
  • Accessibility: Adherence to WCAG 2.1 principles for the UI shell, with specific focus on keyboard-only navigation for the data entry grid.

Implementation Considerations

  • Security: JWT-based authentication for Sarah's (Admin) user management. All data uploads must be scanned for malicious macros/content.
  • Stateless Backend: The Python API (FastAPI) will remain largely stateless, receiving data via secure requests and returning analytical results/visualizations in JSON/Base64 format.

Project Scoping & Phased Development

MVP Strategy & Philosophy

MVP Approach: Experience MVP - Stateless & Fast. Core Value: Deliver a "Zero-Setup" analytical tool where users get results instantly without creating accounts or managing projects. Focus on the quality of the interaction and the analysis report.

MVP Feature Set (Phase 1)

Core User Journeys Supported:

  • Julien (Analyst): Full flow from upload to regression report.
  • Marc (Manager): Reading the generated PDF report.
  • (Deferred) Sarah (Admin): No admin dashboard needed yet as the system is stateless/public.

Must-Have Capabilities:

  • Input: Drag & Drop Excel/CSV parser (Pandas).
  • Interaction: Interactive Data Grid (Read/Write) for quick cleaning/filtering.
  • Analysis Core:
    • Automated Outlier Detection (Isolation Forest).
    • Automated Feature Selection (RFE).
    • Models: Linear Regression (Simple/Multiple), Logistic Regression, Correlation Matrix.
  • Output: Interactive Web Report + One-click PDF Export.

Post-MVP Features

Phase 2 (Growth - "Project Mode"):

  • User Accounts & Project Persistence (PostgreSQL).
  • Admin Dashboard for resource monitoring.
  • Advanced Models: Time Series, ANOVA.

Phase 3 (Expansion - "Enterprise"):

  • Collaboration (Real-time editing).
  • Direct connectors (SQL Database, Salesforce).
  • On-premise deployment options (Docker).

Risk Mitigation Strategy

Technical Risks:

  • Grid Performance: Using a robust React Data Grid library (TanStack Table or AG Grid Community) to handle DOM virtualization for 50k rows.
  • Stateless Memory: Limiting file upload size (e.g., 50MB) to prevent RAM saturation since we aren't using a DB yet.

Market Risks:

  • Trust: Ensuring the PDF report looks professional enough to be accepted in a formal meeting (Marc's journey).

Functional Requirements

Data Ingestion & Management

  • FR1: Users can upload datasets in .xlsx, .xls, and .csv formats via drag-and-drop or file selection.
  • FR2: System automatically detects column data types (numeric, categorical, datetime) upon ingest.
  • FR3: Users can manually override detected data types if the inference is incorrect.
  • FR4: Users can rename columns directly in the interface to sanitize inputs.

Interactive Data Grid (Workspace)

  • FR5: Users can view loaded data in a paginated, virtualized grid capable of displaying 50,000+ rows.
  • FR6: Users can edit cell values directly (double-click to edit) with inputs validated against the column type.
  • FR7: Users can sort columns (asc/desc) and filter rows based on values/conditions (e.g., "> 100").
  • FR8: Users can perform Undo/Redo operations (Ctrl+Z/Ctrl+Y) on data edits within the current session.
  • FR9: Users can exclude specific rows from analysis without deleting them (soft delete/toggle).

Automated Data Preparation (Smart Prep)

  • FR10: System automatically identifies univariate outliers using IQR/Z-score and visualizes them in the grid/plots.
  • FR11: System automatically identifies multivariate outliers using Isolation Forest upon user request.
  • FR12: Users can accept or reject outlier exclusion proposals individually or in bulk.
  • FR13: Users can select a Target Variable (Y) to trigger an automated Feature Importance analysis.
  • FR14: System recommends the Top-N predictive features based on RFE (Recursive Feature Elimination) or Random Forest importance.

Statistical Modeling (Core Analytics)

  • FR15: Users can configure a Linear Regression (Simple/Multiple) by selecting Dependent (Y) and Independent (X) variables.
  • FR16: Users can configure a Binary Logistic Regression for categorical target variables.
  • FR17: System generates a "Model Summary" including R-squared, Adjusted R-squared, F-statistic, and P-values for coefficients.
  • FR18: System generates standard diagnostic plots: Residuals vs Fitted, Q-Q Plot, and Scale-Location.
  • FR19: Users can view a Correlation Matrix (Heatmap) for selected numeric variables.

Reporting & Reproducibility

  • FR20: Users can view an interactive "Analysis Report" dashboard summarizing data health, methodology, and model results.
  • FR21: Users can export the full report as a branded PDF document.
  • FR22: System appends an "Audit Trail" to the report listing library versions, random seeds, and data exclusion steps for reproducibility.

Non-Functional Requirements

Performance

  • Grid Latency: The interactive data grid must render 50,000 rows with filtering/sorting response times under 200ms (Client-Side Virtualization).
  • Analysis Throughput: Automated analysis (Outlier Detection + Feature Selection) on standard datasets (<10MB) must complete in under 15 seconds.
  • Upload Speed: Parsing and validation of a 5MB Excel file should complete in under 3 seconds.

Security & Privacy

  • Data Ephemerality: All user datasets uploaded to the temporary workspace must be permanently purged from the server memory/storage after 1 hour of inactivity or immediately upon session termination.
  • Transport Security: All data transmission between Client and Server must be encrypted via TLS 1.3.
  • Input Sanitization: The file parser must strictly validate MIME types and file signatures to prevent malicious code execution (e.g., Excel Macros).

Reliability & Stability

  • Graceful Degradation: The system must handle "bad data" (NaNs, infinite values) by providing clear error messages rather than crashing the Python backend (500 Internal Error).
  • Concurrency: The backend must support at least 50 concurrent analysis requests without performance degradation, using an asynchronous task queue (Celery).

Accessibility

  • Keyboard Navigation: The data grid must be fully navigable via keyboard (Arrows, Tab, Enter) to support "Power User" workflows efficiently.