office_translator/ARCHITECTURE.md

326 lines
10 KiB
Markdown

# Document Translation API - Architecture Overview
## System Architecture
```
┌─────────────────────────────────────────────────────────────┐
│ FastAPI Application │
│ (main.py) │
└─────────────────────┬───────────────────────────────────────┘
├──> File Upload Endpoint (/translate)
│ ├─> File Validation
│ ├─> File Type Detection
│ └─> Route to Appropriate Translator
├──> Batch Translation (/translate-batch)
└──> Utility Endpoints
├─> /health
├─> /languages
└─> /download/{filename}
┌─────────────────────────────────────────────────────────────┐
│ Translation Layer │
└─────────────────────┬───────────────────────────────────────┘
┌─────────────┼─────────────┐
│ │ │
▼ ▼ ▼
Excel Word PowerPoint
Translator Translator Translator
(.xlsx) (.docx) (.pptx)
│ │ │
└─────────────┼─────────────┘
┌─────────────────────────────────────────────────────────────┐
│ Translation Service Abstraction │
│ (Pluggable Backend) │
└─────────────────────┬───────────────────────────────────────┘
┌─────────────┼─────────────┐
▼ ▼ ▼
Google DeepL LibreTranslate
Translate (API Key) (Self-hosted)
```
## Component Breakdown
### 1. API Layer (`main.py`)
- **FastAPI Application**: RESTful API endpoints
- **File Upload Handling**: Multipart form data processing
- **Request Validation**: Pydantic models for type safety
- **Error Handling**: Custom exception handlers
- **CORS Configuration**: Cross-origin resource sharing
### 2. Translation Coordinators
#### Excel Translator (`translators/excel_translator.py`)
```
Input: .xlsx file
Process:
1. Load workbook with openpyxl (preserve VBA, formulas)
2. Iterate through all worksheets
3. For each cell:
- Detect type (text, formula, number)
- If text: translate
- If formula: extract and translate strings
- Preserve: formatting, colors, borders, merges
4. Translate sheet names
5. Maintain image positions
Output: Translated .xlsx with identical structure
```
#### Word Translator (`translators/word_translator.py`)
```
Input: .docx file
Process:
1. Load document with python-docx
2. Traverse document tree:
- Paragraphs → Runs (preserve formatting per run)
- Tables → Cells → Paragraphs
- Headers/Footers (all section types)
3. Translate text while preserving:
- Font family, size, color
- Bold, italic, underline
- Lists (numbered/bulleted)
- Styles (Heading 1, Normal, etc.)
4. Images remain embedded via relationships
Output: Translated .docx with preserved layout
```
#### PowerPoint Translator (`translators/pptx_translator.py`)
```
Input: .pptx file
Process:
1. Load presentation with python-pptx
2. For each slide:
- Shapes → Text Frames → Paragraphs → Runs
- Tables → Cells → Text Frames
- Groups → Nested Shapes
- Speaker Notes
3. Preserve:
- Slide layouts
- Animations (timing, effects)
- Transitions
- Image positions and layering
- Shape properties (size, position, rotation)
Output: Translated .pptx with identical design
```
### 3. Translation Service Layer
**Abstract Interface**: `TranslationProvider`
- Allows swapping translation backends without changing translators
- Configurable via environment variables
**Implementations**:
1. **Google Translator** (Default, Free)
- Uses deep-translator library
- No API key required
- Rate limited
2. **DeepL** (Premium, API Key Required)
- Higher quality translations
- Better context understanding
- Requires paid API key
3. **LibreTranslate** (Self-hosted)
- Open-source alternative
- Full control and privacy
- Requires local installation
### 4. Utility Layer
#### File Handler (`utils/file_handler.py`)
- File validation (size, type)
- Unique filename generation (UUID-based)
- Safe file operations
- Cleanup management
#### Exception Handling (`utils/exceptions.py`)
- Custom exception types
- HTTP status code mapping
- User-friendly error messages
### 5. Configuration (`config.py`)
- Environment variable loading
- Directory management
- Service configuration
- Validation rules
## Data Flow
### Single Document Translation
```
1. Client uploads file via POST /translate
└─> File + target_language + source_language
2. API validates request
├─> Check file extension
├─> Verify file size
└─> Validate language codes
3. Save to temporary storage
└─> uploads/{unique_id}_{filename}
4. Route to appropriate translator
├─> .xlsx → ExcelTranslator
├─> .docx → WordTranslator
└─> .pptx → PowerPointTranslator
5. Translator processes document
├─> Parse structure
├─> Extract text elements
├─> Call translation service for each text
├─> Apply translations while preserving formatting
└─> Save to outputs/{unique_id}_translated_{filename}
6. Return translated file
└─> FileResponse with download headers
7. Cleanup (optional)
└─> Delete uploaded file
```
## Formatting Preservation Strategies
### Excel
- **Cell Properties**: Copied before translation
- **Merged Cells**: Detected via `cell.merge_cells`
- **Formulas**: Regex parsing to extract strings
- **Images**: Anchored to cells, preserved via relationships
- **Charts**: Remain linked to data ranges
### Word
- **Run-level Translation**: Preserves inline formatting
- **Style Inheritance**: Paragraph styles maintained
- **Tables**: Structure preserved, cells translated individually
- **Images**: Embedded via relationships, not modified
- **Headers/Footers**: Treated as separate sections
### PowerPoint
- **Shape Hierarchy**: Recursive traversal
- **Text Frames**: Paragraph and run-level translation
- **Layouts**: Template references preserved
- **Animations**: Stored separately, not affected
- **Media**: File references remain intact
## Scalability Considerations
### Horizontal Scaling
- Stateless design (no session storage)
- Files stored on disk (can move to S3/Azure Blob)
- Load balancer compatible
### Performance Optimization
- **Async I/O**: FastAPI's async capabilities
- **Batch Processing**: Multiple files in parallel
- **Caching**: Translation cache for repeated text
- **Streaming**: Large file chunking (future enhancement)
### Resource Management
- **File Cleanup**: Automatic deletion after translation
- **Size Limits**: Configurable max file size
- **Rate Limiting**: Prevent API abuse
- **Queue System**: Redis-based job queue (future)
## Future MCP Integration
### MCP Server Wrapper
The API is designed to be wrapped as an MCP server:
```python
# MCP Tools
1. translate_document(file_path, target_lang) translated_file
2. get_supported_languages() language_list
3. check_api_health() status
# Benefits
- AI assistants can translate documents seamlessly
- Integration with Claude, GPT, and other LLMs
- Workflow automation in AI pipelines
```
## Security Architecture
### Input Validation
- File type whitelist
- Size restrictions
- Extension verification
- Content-type checking
### File Isolation
- Unique filenames (UUID)
- Temporary storage
- Automatic cleanup
- No path traversal
### API Security (Production)
- Rate limiting (not yet implemented)
- Authentication/Authorization (future)
- HTTPS/TLS encryption (deployment config)
- Input sanitization
## Deployment Architecture
### Development
```
Local Machine
├─> Python 3.11+
├─> Virtual Environment
├─> SQLite (if needed for tracking)
└─> Local file storage
```
### Production (Recommended)
```
Cloud Platform (AWS/Azure/GCP)
├─> Container (Docker)
├─> Load Balancer
├─> Multiple API Instances
├─> Object Storage (S3/Blob)
├─> Redis (caching/queue)
├─> Monitoring (Prometheus/Grafana)
└─> Logging (ELK Stack)
```
## Technology Stack
| Layer | Technology | Purpose |
|-------|------------|---------|
| API Framework | FastAPI | High-performance async API |
| Excel Processing | openpyxl | Full Excel feature support |
| Word Processing | python-docx | DOCX manipulation |
| PowerPoint Processing | python-pptx | PPTX handling |
| Translation | deep-translator | Multi-provider abstraction |
| Server | Uvicorn | ASGI server |
| Validation | Pydantic | Request/response validation |
## Extension Points
1. **Add Translation Provider**
- Implement `TranslationProvider` interface
- Register in `translation_service.py`
2. **Add Document Type**
- Create new translator class
- Register in routing logic
- Add to supported extensions
3. **Add MCP Server**
- Use provided `mcp_server_example.py`
- Configure in MCP settings
- Deploy alongside API
4. **Add Caching**
- Implement translation cache
- Use Redis or in-memory cache
- Reduce API calls for repeated text
5. **Add Queue System**
- Implement Celery/RQ workers
- Handle long-running translations
- Provide job status endpoints