383 lines
10 KiB
Markdown
383 lines
10 KiB
Markdown
# 📄 Document Translation API
|
||
|
||
A powerful SaaS-ready Python API for translating complex structured documents (Excel, Word, PowerPoint) while **strictly preserving** the original formatting, layout, and embedded media.
|
||
|
||
## ✨ Features
|
||
|
||
### 🔄 Multiple Translation Providers
|
||
| Provider | Type | Description |
|
||
|----------|------|-------------|
|
||
| **Google Translate** | Cloud | Free, fast, reliable |
|
||
| **Ollama** | Local LLM | Privacy-focused, customizable with system prompts |
|
||
| **WebLLM** | Browser | Runs entirely in browser using WebGPU |
|
||
| **DeepL** | Cloud | High-quality translations (API key required) |
|
||
| **LibreTranslate** | Self-hosted | Open-source alternative |
|
||
| **OpenAI** | Cloud | GPT-4o/4o-mini with vision support |
|
||
|
||
### 📊 Excel Translation (.xlsx)
|
||
- ✅ Translates all cell content and sheet names
|
||
- ✅ Preserves cell merging, formulas, and styles
|
||
- ✅ Maintains font styles, colors, and borders
|
||
- ✅ Image text extraction with vision models
|
||
- ✅ Adds translated image text as comments
|
||
|
||
### 📝 Word Translation (.docx)
|
||
- ✅ Translates body text, headers, footers, and tables
|
||
- ✅ Preserves heading styles and paragraph formatting
|
||
- ✅ Maintains lists, images, charts, and SmartArt
|
||
- ✅ Image text extraction and translation
|
||
|
||
### 📽️ PowerPoint Translation (.pptx)
|
||
- ✅ Translates slide titles, body text, and speaker notes
|
||
- ✅ Preserves slide layouts, transitions, and animations
|
||
- ✅ Image text extraction with text boxes added below images
|
||
- ✅ Keeps layering order and positions
|
||
|
||
### 🧠 LLM Features (Ollama/WebLLM/OpenAI)
|
||
- ✅ **Custom System Prompts**: Provide context for better translations
|
||
- ✅ **Technical Glossary**: Define term mappings (e.g., `batterie=coil`)
|
||
- ✅ **Presets**: HVAC, IT, Legal, Medical terminology
|
||
- ✅ **Vision Models**: Translate text within images (gemma3, qwen3-vl, gpt-4o)
|
||
|
||
### 🏢 SaaS-Ready Features
|
||
- 🚦 **Rate Limiting**: Per-client IP with token bucket and sliding window algorithms
|
||
- 🔒 **Security Headers**: CSP, XSS protection, HSTS support
|
||
- 🧹 **Auto Cleanup**: Automatic file cleanup with TTL tracking
|
||
- 📊 **Monitoring**: Health checks, metrics, and system status
|
||
- 🔐 **Admin Dashboard**: Secure admin panel with authentication
|
||
- 📝 **Request Logging**: Structured logging with unique request IDs
|
||
|
||
## 🚀 Quick Start
|
||
|
||
### Installation
|
||
|
||
```powershell
|
||
# Clone the repository
|
||
git clone https://gitea.parsanet.org/sepehr/office_translator.git
|
||
cd office_translator
|
||
|
||
# Create virtual environment
|
||
python -m venv venv
|
||
.\venv\Scripts\Activate.ps1
|
||
|
||
# Install dependencies
|
||
pip install -r requirements.txt
|
||
|
||
# Run the API
|
||
python main.py
|
||
```
|
||
|
||
The API starts on `http://localhost:8000`
|
||
|
||
### Frontend Setup
|
||
|
||
```powershell
|
||
cd frontend
|
||
npm install
|
||
npm run dev
|
||
```
|
||
|
||
Frontend runs on `http://localhost:3000`
|
||
|
||
## 📚 API Documentation
|
||
|
||
- **Swagger UI**: http://localhost:8000/docs
|
||
- **ReDoc**: http://localhost:8000/redoc
|
||
|
||
## 🔧 API Endpoints
|
||
|
||
### Translation
|
||
|
||
#### POST /translate
|
||
Translate a document with full customization.
|
||
|
||
```bash
|
||
curl -X POST "http://localhost:8000/translate" \
|
||
-F "file=@document.xlsx" \
|
||
-F "target_language=en" \
|
||
-F "provider=ollama" \
|
||
-F "ollama_model=gemma3:12b" \
|
||
-F "translate_images=true" \
|
||
-F "system_prompt=You are translating HVAC documents."
|
||
```
|
||
|
||
### Monitoring
|
||
|
||
#### GET /health
|
||
Comprehensive health check with system status.
|
||
|
||
```json
|
||
{
|
||
"status": "healthy",
|
||
"translation_service": "google",
|
||
"memory": {"system_percent": 34.1, "system_available_gb": 61.7},
|
||
"disk": {"total_files": 0, "total_size_mb": 0},
|
||
"cleanup_service": {"is_running": true}
|
||
}
|
||
```
|
||
|
||
#### GET /metrics
|
||
System metrics and statistics.
|
||
|
||
#### GET /rate-limit/status
|
||
Current rate limit status for the requesting client.
|
||
|
||
### Admin Endpoints (Authentication Required)
|
||
|
||
#### POST /admin/login
|
||
Login to admin dashboard.
|
||
|
||
```bash
|
||
curl -X POST "http://localhost:8000/admin/login" \
|
||
-F "username=admin" \
|
||
-F "password=your_password"
|
||
```
|
||
|
||
Response:
|
||
```json
|
||
{
|
||
"status": "success",
|
||
"token": "your_bearer_token",
|
||
"expires_in": 86400
|
||
}
|
||
```
|
||
|
||
#### GET /admin/dashboard
|
||
Get comprehensive dashboard data (requires Bearer token).
|
||
|
||
```bash
|
||
curl "http://localhost:8000/admin/dashboard" \
|
||
-H "Authorization: Bearer your_token"
|
||
```
|
||
|
||
#### POST /admin/cleanup/trigger
|
||
Manually trigger file cleanup.
|
||
|
||
#### GET /admin/files/tracked
|
||
List currently tracked files.
|
||
|
||
## 🌐 Supported Languages
|
||
|
||
| Code | Language | Code | Language |
|
||
|------|----------|------|----------|
|
||
| en | English | fr | French |
|
||
| fa | Persian/Farsi | es | Spanish |
|
||
| de | German | it | Italian |
|
||
| pt | Portuguese | ru | Russian |
|
||
| zh | Chinese | ja | Japanese |
|
||
| ko | Korean | ar | Arabic |
|
||
|
||
## ⚙️ Configuration
|
||
|
||
### Environment Variables (.env)
|
||
|
||
```env
|
||
# ============== Translation Services ==============
|
||
TRANSLATION_SERVICE=google
|
||
DEEPL_API_KEY=your_deepl_api_key_here
|
||
|
||
# Ollama Configuration
|
||
OLLAMA_BASE_URL=http://localhost:11434
|
||
OLLAMA_MODEL=llama3
|
||
OLLAMA_VISION_MODEL=llava
|
||
|
||
# ============== File Limits ==============
|
||
MAX_FILE_SIZE_MB=50
|
||
|
||
# ============== Rate Limiting (SaaS) ==============
|
||
RATE_LIMIT_ENABLED=true
|
||
RATE_LIMIT_PER_MINUTE=30
|
||
RATE_LIMIT_PER_HOUR=200
|
||
TRANSLATIONS_PER_MINUTE=10
|
||
TRANSLATIONS_PER_HOUR=50
|
||
MAX_CONCURRENT_TRANSLATIONS=5
|
||
|
||
# ============== Cleanup Service ==============
|
||
CLEANUP_ENABLED=true
|
||
CLEANUP_INTERVAL_MINUTES=15
|
||
FILE_TTL_MINUTES=60
|
||
INPUT_FILE_TTL_MINUTES=30
|
||
OUTPUT_FILE_TTL_MINUTES=120
|
||
|
||
# ============== Security ==============
|
||
ENABLE_HSTS=false
|
||
CORS_ORIGINS=*
|
||
|
||
# ============== Admin Authentication ==============
|
||
ADMIN_USERNAME=admin
|
||
ADMIN_PASSWORD=changeme123 # Change in production!
|
||
# Or use a SHA256 hash:
|
||
# ADMIN_PASSWORD_HASH=your_sha256_hash
|
||
|
||
# ============== Monitoring ==============
|
||
LOG_LEVEL=INFO
|
||
ENABLE_REQUEST_LOGGING=true
|
||
MAX_MEMORY_PERCENT=80
|
||
```
|
||
|
||
### Ollama Setup
|
||
|
||
```bash
|
||
# Install Ollama (Windows)
|
||
winget install Ollama.Ollama
|
||
|
||
# Pull a model
|
||
ollama pull llama3.2
|
||
|
||
# For vision/image translation
|
||
ollama pull gemma3:12b
|
||
# or
|
||
ollama pull qwen3-vl:8b
|
||
```
|
||
|
||
## 🎯 Using System Prompts & Glossary
|
||
|
||
### Example: HVAC Translation
|
||
|
||
**System Prompt:**
|
||
```
|
||
You are translating HVAC technical documents.
|
||
Use precise technical terminology.
|
||
Keep unit measurements (kW, m³/h, Pa) unchanged.
|
||
```
|
||
|
||
**Glossary:**
|
||
```
|
||
batterie=coil
|
||
groupe froid=chiller
|
||
CTA=AHU (Air Handling Unit)
|
||
échangeur=heat exchanger
|
||
vanne 3 voies=3-way valve
|
||
```
|
||
|
||
### Presets Available
|
||
- 🔧 **HVAC**: Heating, Ventilation, Air Conditioning
|
||
- 💻 **IT**: Software and technology
|
||
- ⚖️ **Legal**: Legal documents
|
||
- 🏥 **Medical**: Healthcare terminology
|
||
|
||
## <20> Admin Dashboard
|
||
|
||
Access the admin dashboard at `/admin` in the frontend. Features:
|
||
|
||
- **System Status**: Health, uptime, and issues
|
||
- **Memory & Disk Monitoring**: Real-time usage stats
|
||
- **Translation Statistics**: Total translations, success rate
|
||
- **Rate Limit Management**: View active clients and limits
|
||
- **Cleanup Service**: Monitor and trigger manual cleanup
|
||
|
||
### Default Credentials
|
||
- **Username**: admin
|
||
- **Password**: changeme123
|
||
|
||
⚠️ **Change the default password in production!**
|
||
|
||
## 🏗️ Project Structure
|
||
|
||
```
|
||
Translate/
|
||
├── main.py # FastAPI application with SaaS features
|
||
├── config.py # Configuration with SaaS settings
|
||
├── requirements.txt # Dependencies
|
||
├── mcp_server.py # MCP server implementation
|
||
├── middleware/ # SaaS middleware
|
||
│ ├── __init__.py
|
||
│ ├── rate_limiting.py # Rate limiting with token bucket
|
||
│ ├── validation.py # Input validation
|
||
│ ├── security.py # Security headers & logging
|
||
│ └── cleanup.py # Auto cleanup service
|
||
├── services/
|
||
│ └── translation_service.py # Translation providers
|
||
├── translators/
|
||
│ ├── excel_translator.py # Excel with image support
|
||
│ ├── word_translator.py # Word with image support
|
||
│ └── pptx_translator.py # PowerPoint with image support
|
||
├── frontend/ # Next.js frontend
|
||
│ ├── src/
|
||
│ │ ├── app/
|
||
│ │ │ ├── page.tsx # Main translation page
|
||
│ │ │ ├── admin/ # Admin dashboard
|
||
│ │ │ └── settings/ # Settings pages
|
||
│ │ └── components/
|
||
│ └── package.json
|
||
├── static/
|
||
│ └── webllm.html # WebLLM standalone interface
|
||
├── uploads/ # Temporary uploads (auto-cleaned)
|
||
└── outputs/ # Translated files (auto-cleaned)
|
||
```
|
||
|
||
## 🛠️ Tech Stack
|
||
|
||
### Backend
|
||
- **FastAPI**: Modern async web framework
|
||
- **openpyxl**: Excel manipulation
|
||
- **python-docx**: Word documents
|
||
- **python-pptx**: PowerPoint presentations
|
||
- **deep-translator**: Google/DeepL/Libre translation
|
||
- **psutil**: System monitoring
|
||
- **python-magic**: File type validation
|
||
|
||
### Frontend
|
||
- **Next.js 15**: React framework
|
||
- **Tailwind CSS**: Styling
|
||
- **Lucide Icons**: Icon library
|
||
- **WebLLM**: Browser-based LLM
|
||
|
||
## 🔌 MCP Integration
|
||
|
||
This API can be used as an MCP (Model Context Protocol) server for AI assistants.
|
||
|
||
### VS Code Configuration
|
||
|
||
Add to your VS Code `settings.json` or `.vscode/mcp.json`:
|
||
|
||
```json
|
||
{
|
||
"servers": {
|
||
"document-translator": {
|
||
"type": "stdio",
|
||
"command": "python",
|
||
"args": ["mcp_server.py"],
|
||
"cwd": "D:/Translate",
|
||
"env": {
|
||
"PYTHONPATH": "D:/Translate"
|
||
}
|
||
}
|
||
}
|
||
}
|
||
```
|
||
|
||
## 🚀 Production Deployment
|
||
|
||
### Security Checklist
|
||
- [ ] Change `ADMIN_PASSWORD` or set `ADMIN_PASSWORD_HASH`
|
||
- [ ] Set `CORS_ORIGINS` to your frontend domain
|
||
- [ ] Enable `ENABLE_HSTS=true` if using HTTPS
|
||
- [ ] Configure rate limits appropriately
|
||
- [ ] Set up log rotation for `logs/` directory
|
||
- [ ] Use a reverse proxy (nginx/traefik) for HTTPS
|
||
|
||
### Docker Deployment (Coming Soon)
|
||
|
||
```dockerfile
|
||
FROM python:3.11-slim
|
||
WORKDIR /app
|
||
COPY requirements.txt .
|
||
RUN pip install -r requirements.txt
|
||
COPY . .
|
||
EXPOSE 8000
|
||
CMD ["uvicorn", "main:app", "--host", "0.0.0.0", "--port", "8000"]
|
||
```
|
||
|
||
## 📝 License
|
||
|
||
MIT License
|
||
|
||
## 🤝 Contributing
|
||
|
||
Contributions welcome! Please submit a Pull Request.
|
||
|
||
---
|
||
|
||
**Built with ❤️ using Python, FastAPI, Next.js, and Ollama**
|