326 lines
10 KiB
Markdown
326 lines
10 KiB
Markdown
# Document Translation API - Architecture Overview
|
|
|
|
## System Architecture
|
|
|
|
```
|
|
┌─────────────────────────────────────────────────────────────┐
|
|
│ FastAPI Application │
|
|
│ (main.py) │
|
|
└─────────────────────┬───────────────────────────────────────┘
|
|
│
|
|
├──> File Upload Endpoint (/translate)
|
|
│ ├─> File Validation
|
|
│ ├─> File Type Detection
|
|
│ └─> Route to Appropriate Translator
|
|
│
|
|
├──> Batch Translation (/translate-batch)
|
|
│
|
|
└──> Utility Endpoints
|
|
├─> /health
|
|
├─> /languages
|
|
└─> /download/{filename}
|
|
|
|
┌─────────────────────────────────────────────────────────────┐
|
|
│ Translation Layer │
|
|
└─────────────────────┬───────────────────────────────────────┘
|
|
│
|
|
┌─────────────┼─────────────┐
|
|
│ │ │
|
|
▼ ▼ ▼
|
|
Excel Word PowerPoint
|
|
Translator Translator Translator
|
|
(.xlsx) (.docx) (.pptx)
|
|
│ │ │
|
|
└─────────────┼─────────────┘
|
|
│
|
|
▼
|
|
┌─────────────────────────────────────────────────────────────┐
|
|
│ Translation Service Abstraction │
|
|
│ (Pluggable Backend) │
|
|
└─────────────────────┬───────────────────────────────────────┘
|
|
│
|
|
┌─────────────┼─────────────┐
|
|
▼ ▼ ▼
|
|
Google DeepL LibreTranslate
|
|
Translate (API Key) (Self-hosted)
|
|
```
|
|
|
|
## Component Breakdown
|
|
|
|
### 1. API Layer (`main.py`)
|
|
- **FastAPI Application**: RESTful API endpoints
|
|
- **File Upload Handling**: Multipart form data processing
|
|
- **Request Validation**: Pydantic models for type safety
|
|
- **Error Handling**: Custom exception handlers
|
|
- **CORS Configuration**: Cross-origin resource sharing
|
|
|
|
### 2. Translation Coordinators
|
|
|
|
#### Excel Translator (`translators/excel_translator.py`)
|
|
```
|
|
Input: .xlsx file
|
|
Process:
|
|
1. Load workbook with openpyxl (preserve VBA, formulas)
|
|
2. Iterate through all worksheets
|
|
3. For each cell:
|
|
- Detect type (text, formula, number)
|
|
- If text: translate
|
|
- If formula: extract and translate strings
|
|
- Preserve: formatting, colors, borders, merges
|
|
4. Translate sheet names
|
|
5. Maintain image positions
|
|
Output: Translated .xlsx with identical structure
|
|
```
|
|
|
|
#### Word Translator (`translators/word_translator.py`)
|
|
```
|
|
Input: .docx file
|
|
Process:
|
|
1. Load document with python-docx
|
|
2. Traverse document tree:
|
|
- Paragraphs → Runs (preserve formatting per run)
|
|
- Tables → Cells → Paragraphs
|
|
- Headers/Footers (all section types)
|
|
3. Translate text while preserving:
|
|
- Font family, size, color
|
|
- Bold, italic, underline
|
|
- Lists (numbered/bulleted)
|
|
- Styles (Heading 1, Normal, etc.)
|
|
4. Images remain embedded via relationships
|
|
Output: Translated .docx with preserved layout
|
|
```
|
|
|
|
#### PowerPoint Translator (`translators/pptx_translator.py`)
|
|
```
|
|
Input: .pptx file
|
|
Process:
|
|
1. Load presentation with python-pptx
|
|
2. For each slide:
|
|
- Shapes → Text Frames → Paragraphs → Runs
|
|
- Tables → Cells → Text Frames
|
|
- Groups → Nested Shapes
|
|
- Speaker Notes
|
|
3. Preserve:
|
|
- Slide layouts
|
|
- Animations (timing, effects)
|
|
- Transitions
|
|
- Image positions and layering
|
|
- Shape properties (size, position, rotation)
|
|
Output: Translated .pptx with identical design
|
|
```
|
|
|
|
### 3. Translation Service Layer
|
|
|
|
**Abstract Interface**: `TranslationProvider`
|
|
- Allows swapping translation backends without changing translators
|
|
- Configurable via environment variables
|
|
|
|
**Implementations**:
|
|
1. **Google Translator** (Default, Free)
|
|
- Uses deep-translator library
|
|
- No API key required
|
|
- Rate limited
|
|
|
|
2. **DeepL** (Premium, API Key Required)
|
|
- Higher quality translations
|
|
- Better context understanding
|
|
- Requires paid API key
|
|
|
|
3. **LibreTranslate** (Self-hosted)
|
|
- Open-source alternative
|
|
- Full control and privacy
|
|
- Requires local installation
|
|
|
|
### 4. Utility Layer
|
|
|
|
#### File Handler (`utils/file_handler.py`)
|
|
- File validation (size, type)
|
|
- Unique filename generation (UUID-based)
|
|
- Safe file operations
|
|
- Cleanup management
|
|
|
|
#### Exception Handling (`utils/exceptions.py`)
|
|
- Custom exception types
|
|
- HTTP status code mapping
|
|
- User-friendly error messages
|
|
|
|
### 5. Configuration (`config.py`)
|
|
- Environment variable loading
|
|
- Directory management
|
|
- Service configuration
|
|
- Validation rules
|
|
|
|
## Data Flow
|
|
|
|
### Single Document Translation
|
|
```
|
|
1. Client uploads file via POST /translate
|
|
└─> File + target_language + source_language
|
|
|
|
2. API validates request
|
|
├─> Check file extension
|
|
├─> Verify file size
|
|
└─> Validate language codes
|
|
|
|
3. Save to temporary storage
|
|
└─> uploads/{unique_id}_{filename}
|
|
|
|
4. Route to appropriate translator
|
|
├─> .xlsx → ExcelTranslator
|
|
├─> .docx → WordTranslator
|
|
└─> .pptx → PowerPointTranslator
|
|
|
|
5. Translator processes document
|
|
├─> Parse structure
|
|
├─> Extract text elements
|
|
├─> Call translation service for each text
|
|
├─> Apply translations while preserving formatting
|
|
└─> Save to outputs/{unique_id}_translated_{filename}
|
|
|
|
6. Return translated file
|
|
└─> FileResponse with download headers
|
|
|
|
7. Cleanup (optional)
|
|
└─> Delete uploaded file
|
|
```
|
|
|
|
## Formatting Preservation Strategies
|
|
|
|
### Excel
|
|
- **Cell Properties**: Copied before translation
|
|
- **Merged Cells**: Detected via `cell.merge_cells`
|
|
- **Formulas**: Regex parsing to extract strings
|
|
- **Images**: Anchored to cells, preserved via relationships
|
|
- **Charts**: Remain linked to data ranges
|
|
|
|
### Word
|
|
- **Run-level Translation**: Preserves inline formatting
|
|
- **Style Inheritance**: Paragraph styles maintained
|
|
- **Tables**: Structure preserved, cells translated individually
|
|
- **Images**: Embedded via relationships, not modified
|
|
- **Headers/Footers**: Treated as separate sections
|
|
|
|
### PowerPoint
|
|
- **Shape Hierarchy**: Recursive traversal
|
|
- **Text Frames**: Paragraph and run-level translation
|
|
- **Layouts**: Template references preserved
|
|
- **Animations**: Stored separately, not affected
|
|
- **Media**: File references remain intact
|
|
|
|
## Scalability Considerations
|
|
|
|
### Horizontal Scaling
|
|
- Stateless design (no session storage)
|
|
- Files stored on disk (can move to S3/Azure Blob)
|
|
- Load balancer compatible
|
|
|
|
### Performance Optimization
|
|
- **Async I/O**: FastAPI's async capabilities
|
|
- **Batch Processing**: Multiple files in parallel
|
|
- **Caching**: Translation cache for repeated text
|
|
- **Streaming**: Large file chunking (future enhancement)
|
|
|
|
### Resource Management
|
|
- **File Cleanup**: Automatic deletion after translation
|
|
- **Size Limits**: Configurable max file size
|
|
- **Rate Limiting**: Prevent API abuse
|
|
- **Queue System**: Redis-based job queue (future)
|
|
|
|
## Future MCP Integration
|
|
|
|
### MCP Server Wrapper
|
|
The API is designed to be wrapped as an MCP server:
|
|
|
|
```python
|
|
# MCP Tools
|
|
1. translate_document(file_path, target_lang) → translated_file
|
|
2. get_supported_languages() → language_list
|
|
3. check_api_health() → status
|
|
|
|
# Benefits
|
|
- AI assistants can translate documents seamlessly
|
|
- Integration with Claude, GPT, and other LLMs
|
|
- Workflow automation in AI pipelines
|
|
```
|
|
|
|
## Security Architecture
|
|
|
|
### Input Validation
|
|
- File type whitelist
|
|
- Size restrictions
|
|
- Extension verification
|
|
- Content-type checking
|
|
|
|
### File Isolation
|
|
- Unique filenames (UUID)
|
|
- Temporary storage
|
|
- Automatic cleanup
|
|
- No path traversal
|
|
|
|
### API Security (Production)
|
|
- Rate limiting (not yet implemented)
|
|
- Authentication/Authorization (future)
|
|
- HTTPS/TLS encryption (deployment config)
|
|
- Input sanitization
|
|
|
|
## Deployment Architecture
|
|
|
|
### Development
|
|
```
|
|
Local Machine
|
|
├─> Python 3.11+
|
|
├─> Virtual Environment
|
|
├─> SQLite (if needed for tracking)
|
|
└─> Local file storage
|
|
```
|
|
|
|
### Production (Recommended)
|
|
```
|
|
Cloud Platform (AWS/Azure/GCP)
|
|
├─> Container (Docker)
|
|
├─> Load Balancer
|
|
├─> Multiple API Instances
|
|
├─> Object Storage (S3/Blob)
|
|
├─> Redis (caching/queue)
|
|
├─> Monitoring (Prometheus/Grafana)
|
|
└─> Logging (ELK Stack)
|
|
```
|
|
|
|
## Technology Stack
|
|
|
|
| Layer | Technology | Purpose |
|
|
|-------|------------|---------|
|
|
| API Framework | FastAPI | High-performance async API |
|
|
| Excel Processing | openpyxl | Full Excel feature support |
|
|
| Word Processing | python-docx | DOCX manipulation |
|
|
| PowerPoint Processing | python-pptx | PPTX handling |
|
|
| Translation | deep-translator | Multi-provider abstraction |
|
|
| Server | Uvicorn | ASGI server |
|
|
| Validation | Pydantic | Request/response validation |
|
|
|
|
## Extension Points
|
|
|
|
1. **Add Translation Provider**
|
|
- Implement `TranslationProvider` interface
|
|
- Register in `translation_service.py`
|
|
|
|
2. **Add Document Type**
|
|
- Create new translator class
|
|
- Register in routing logic
|
|
- Add to supported extensions
|
|
|
|
3. **Add MCP Server**
|
|
- Use provided `mcp_server_example.py`
|
|
- Configure in MCP settings
|
|
- Deploy alongside API
|
|
|
|
4. **Add Caching**
|
|
- Implement translation cache
|
|
- Use Redis or in-memory cache
|
|
- Reduce API calls for repeated text
|
|
|
|
5. **Add Queue System**
|
|
- Implement Celery/RQ workers
|
|
- Handle long-running translations
|
|
- Provide job status endpoints
|