6.5 KiB
| stepsCompleted | inputDocuments | workflowType | project_name | user_name | date | |||||||
|---|---|---|---|---|---|---|---|---|---|---|---|---|
|
|
architecture | Data_analysis | Sepehr | 2026-01-10 |
Architecture Decision Document
This document builds collaboratively through step-by-step discovery. Sections are appended as we work through each architectural decision together.
Project Context Analysis
Requirements Overview
Functional Requirements: The system requires a robust data processing pipeline capable of ingesting diverse file formats (Excel/CSV), performing automated statistical analysis (Outlier Detection, RFE), and rendering interactive visualizations. The frontend must support a high-performance, editable grid ("Smart Grid") that mimics spreadsheet behavior.
Non-Functional Requirements:
- Performance: Sub-second response times for grid interactions on datasets up to 50k rows.
- Stateless Architecture: Phase 1 requires no persistent user data storage; sessions are ephemeral.
- Scientific Rigor: Reproducibility of results is paramount, requiring strict versioning of libraries and random seeds.
- Security: Secure file handling and transport (TLS 1.3) are mandatory.
Scale & Complexity:
- Primary Domain: Scientific Web Application (Full-stack).
- Complexity Level: Medium. The complexity lies in the bridge between the interactive frontend and the computational backend, ensuring synchronization and performance.
- Estimated Architectural Components: ~5 Core Components (Frontend Shell, Data Grid, Visualization Engine, API Gateway, Computational Worker).
Technical Constraints & Dependencies
- Backend: Python is mandatory for the scientific stack (Pandas, Scikit-learn, Statsmodels).
- Frontend: Next.js 16 with React Server Components (for shell) and Client Components (for grid).
- UI Library: Shadcn UI + TanStack Table (headless) + Recharts.
- Deployment: Must support containerized deployment (Docker) for reproducibility.
Cross-Cutting Concerns Identified
- Data Serialization: Efficient transfer of large datasets (JSON/Arrow) between Python backend and React frontend.
- State Management: Synchronizing the client-side grid state with the server-side analysis context.
- Error Handling: Unifying error reporting from the Python backend to the React UI (e.g., "Singular Matrix" error).
Starter Template Evaluation
Primary Technology Domain
Scientific Data Application (Full-stack Next.js + FastAPI) optimized for self-hosting.
Selected Starter: Custom FastAPI-Next.js-Docker Boilerplate
Rationale for Selection: Explicitly chosen to support a "Two-Service" deployment model on a Homelab infrastructure. This ensures process isolation between the analytical Python engine and the React UI.
Architectural Decisions Provided by Starter:
- Language & Runtime: Python 3.12 (Backend managed by uv) and Node.js 20 (Frontend).
- Styling Solution: Tailwind CSS with Shadcn UI.
- Testing: Pytest (Backend) and Vitest (Frontend).
- Code Organization: Clean Monorepo with separated service directories.
Deployment Strategy (Homelab):
- Frontend Service: Next.js in Standalone mode (Docker).
- Backend Service: FastAPI with Uvicorn (Docker).
- Communication: Internal Docker network for API requests to minimize latency.
Core Architectural Decisions
Decision Priority Analysis
Critical Decisions (Block Implementation):
- Data Serialization Protocol: Apache Arrow (IPC Stream) is mandatory for performance.
- State Management Strategy: Hybrid (TanStack Query for Async + Zustand for UI State).
- Container Strategy: Docker Compose with isolated networks for Homelab deployment.
Data Architecture
- Format: Apache Arrow (IPC Stream) for grid data; JSON for control plane.
- Validation: Pydantic (v2) for all JSON payloads.
- Persistence: None (Stateless) for Phase 1.
tempfilemodule in Python for transient storage during analysis.
API & Communication Patterns
- Protocol: REST API (FastAPI) with
StreamingResponsefor data export. - Serialization:
pyarrow.ipc.new_streamon backend ->tableFromIPCon frontend. - CORS: Strictly configured to allow only the Homelab domain (e.g.,
data.home).
Frontend Architecture
- State Manager:
- Zustand (v5): For high-frequency grid state (selection, edits).
- TanStack Query (v5): For analytical job status and data fetching.
- Component Architecture: "Smart Grid" pattern where the Grid component subscribes directly to the Zustand store to avoid re-rendering the entire page.
Infrastructure & Deployment
- Containerization: Multi-stage Docker builds to keep images light (distroless/python and node-alpine).
- Orchestration: Docker Compose file defining
frontend,backend, and a sharednetwork.
Implementation Patterns & Consistency Rules
Pattern Categories Defined
Critical Conflict Points Identified: 5 major areas where AI agents must align to prevent implementation divergence.
Naming Patterns
- Backend (Python): Strict
snake_casefor modules, functions, and variables (PEP 8). - Frontend (TSX):
PascalCasefor Components (SmartGrid.tsx),camelCasefor hooks and utilities. - API / JSON:
snake_casefor all keys to maintain 1:1 mapping with Pandas DataFrame columns and Pydantic models.
Structure Patterns
- Project Organization: Co-located logic. Features are grouped in folders:
/features/data-grid,/features/analysis-engine. - Test Location: Centralized
/testsdirectory at the service root (e.g.,backend/tests/,frontend/tests/) to simplify Docker test runs.
Format Patterns
- API Response Wrapper:
- Success:
{ "status": "success", "data": ..., "metadata": {...} }. - Error:
{ "status": "error", "message": "User-friendly message", "code": "TECHNICAL_ERROR_CODE" }.
- Success:
- Date Format: ISO 8601 strings (
YYYY-MM-DDTHH:mm:ssZ) in UTC.
Process Patterns
- Loading States: Standardized
isLoadingandisProcessingflags in Zustand/TanStack Query. - Validation:
- Backend: Pydantic v2.
- Frontend: Zod (synchronized with Pydantic via OpenAPI generator).
Enforcement Guidelines
All AI Agents MUST:
- Check for existing Pydantic models before creating new ones.
- Use the
loggerutility instead ofprint()orconsole.log. - Add JSDoc/Docstrings to every exported function.