Files
office_translator/README.md
Sepehr Ramezani 26bd096a06 feat: production deployment - full update with providers, admin, glossaries, pricing, tests
Major changes across backend, frontend, infrastructure:
- Provider system with model selection (Google, DeepL, OpenAI, Ollama, Google Cloud)
- Admin panel: user management, pricing, settings
- Glossary system with CSV import/export
- Subscription and tier quota management
- Security hardening (rate limiting, API key auth, path traversal fixes)
- Docker compose for dev, prod, and IONOS deployment
- Alembic migrations for new tables
- Frontend: dashboard, pricing page, landing page, i18n (en/fr)
- Test suite and verification scripts

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
2026-04-25 15:01:47 +02:00

390 lines
11 KiB
Markdown
Raw Blame History

This file contains invisible Unicode characters
This file contains invisible Unicode characters that are indistinguishable to humans but may be processed differently by a computer. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.
This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.
# 📄 Document Translation API
A powerful SaaS-ready Python API for translating complex structured documents (Excel, Word, PowerPoint) while **strictly preserving** the original formatting, layout, and embedded media.
## ✨ Features
### 🔄 Multiple Translation Providers
| Provider | Type | Description |
|----------|------|-------------|
| **Google Translate** | Cloud | Free, fast, reliable |
| **Ollama** | Local LLM | Privacy-focused, customizable with system prompts |
| **WebLLM** | Browser | Runs entirely in browser using WebGPU |
| **DeepL** | Cloud | High-quality translations (API key required) |
| **LibreTranslate** | Self-hosted | Open-source alternative |
| **OpenAI** | Cloud | GPT-4o/4o-mini with vision support |
### 📊 Excel Translation (.xlsx)
- ✅ Translates all cell content and sheet names
- ✅ Preserves cell merging, formulas, and styles
- ✅ Maintains font styles, colors, and borders
- ✅ Image text extraction with vision models
- ✅ Adds translated image text as comments
### 📝 Word Translation (.docx)
- ✅ Translates body text, headers, footers, and tables
- ✅ Preserves heading styles and paragraph formatting
- ✅ Maintains lists, images, charts, and SmartArt
- ✅ Image text extraction and translation
### 📽️ PowerPoint Translation (.pptx)
- ✅ Translates slide titles, body text, and speaker notes
- ✅ Preserves slide layouts, transitions, and animations
- ✅ Image text extraction with text boxes added below images
- ✅ Keeps layering order and positions
### 🧠 LLM Features (Ollama/WebLLM/OpenAI)
-**Custom System Prompts**: Provide context for better translations
-**Technical Glossary**: Define term mappings (e.g., `batterie=coil`)
-**Presets**: HVAC, IT, Legal, Medical terminology
-**Vision Models**: Translate text within images (gemma3, qwen3-vl, gpt-4o)
### 🏢 SaaS-Ready Features
- 🚦 **Rate Limiting**: Per-client IP with token bucket and sliding window algorithms
- 🔒 **Security Headers**: CSP, XSS protection, HSTS support
- 🧹 **Auto Cleanup**: Automatic file cleanup with TTL tracking
- 📊 **Monitoring**: Health checks, metrics, and system status
- 🔐 **Admin Dashboard**: Secure admin panel with authentication
- 📝 **Request Logging**: Structured logging with unique request IDs
## 🚀 Quick Start
### Installation
```powershell
# Clone the repository
git clone https://gitea.parsanet.org/sepehr/office_translator.git
cd office_translator
# Create virtual environment
python -m venv venv
.\venv\Scripts\Activate.ps1
# Install dependencies
pip install -r requirements.txt
# Run the API
python main.py
```
The API starts on `http://localhost:8000`
### Frontend Setup
```powershell
cd frontend
npm install
npm run dev
```
Frontend runs on `http://localhost:3000`
## 📚 API Documentation
- **Swagger UI**: http://localhost:8000/docs
- **ReDoc**: http://localhost:8000/redoc
## 🔧 API Endpoints
### Translation
#### POST /translate
Translate a document with full customization.
```bash
curl -X POST "http://localhost:8000/translate" \
-F "file=@document.xlsx" \
-F "target_language=en" \
-F "provider=ollama" \
-F "ollama_model=gemma3:12b" \
-F "translate_images=true" \
-F "system_prompt=You are translating HVAC documents."
```
### Monitoring
#### GET /health
Comprehensive health check with system status.
```json
{
"status": "healthy",
"translation_service": "google",
"memory": {"system_percent": 34.1, "system_available_gb": 61.7},
"disk": {"total_files": 0, "total_size_mb": 0},
"cleanup_service": {"is_running": true}
}
```
#### GET /metrics
System metrics and statistics.
#### GET /rate-limit/status
Current rate limit status for the requesting client.
### Admin Endpoints (Authentication Required)
#### POST /admin/login
Login to admin dashboard.
```bash
curl -X POST "http://localhost:8000/admin/login" \
-F "username=admin" \
-F "password=your_password"
```
Response:
```json
{
"status": "success",
"token": "your_bearer_token",
"expires_in": 86400
}
```
#### GET /admin/dashboard
Get comprehensive dashboard data (requires Bearer token).
```bash
curl "http://localhost:8000/admin/dashboard" \
-H "Authorization: Bearer your_token"
```
#### POST /admin/cleanup/trigger
Manually trigger file cleanup.
#### GET /admin/files/tracked
List currently tracked files.
## 🌐 Supported Languages
| Code | Language | Code | Language |
|------|----------|------|----------|
| en | English | fr | French |
| fa | Persian/Farsi | es | Spanish |
| de | German | it | Italian |
| pt | Portuguese | ru | Russian |
| zh | Chinese | ja | Japanese |
| ko | Korean | ar | Arabic |
## ⚙️ Configuration
### Environment Variables (.env)
1. **Copy** `.env.example` to `.env`: `cp .env.example .env`
2. **Fill** required variables (see comments in `.env.example`: Required vs Optional).
3. In **production** (`ENV=production`), missing required vars (e.g. `JWT_SECRET_KEY`, `ADMIN_USERNAME`, `ADMIN_PASSWORD` or `ADMIN_PASSWORD_HASH`, `ADMIN_TOKEN_SECRET`, `DATABASE_URL`, and `REDIS_URL` if rate limiting is on) cause the app to **fail at startup** with a clear message listing them (Story 6.6, NFR10).
```env
# ============== Translation Services ==============
TRANSLATION_SERVICE=google
DEEPL_API_KEY=your_deepl_api_key_here
# Ollama Configuration
OLLAMA_BASE_URL=http://localhost:11434
OLLAMA_MODEL=llama3
OLLAMA_VISION_MODEL=llava
# ============== File Limits ==============
MAX_FILE_SIZE_MB=50
# ============== Rate Limiting (SaaS) ==============
RATE_LIMIT_ENABLED=true
RATE_LIMIT_PER_MINUTE=30
RATE_LIMIT_PER_HOUR=200
TRANSLATIONS_PER_MINUTE=10
TRANSLATIONS_PER_HOUR=50
MAX_CONCURRENT_TRANSLATIONS=5
# ============== Cleanup Service ==============
CLEANUP_ENABLED=true
CLEANUP_INTERVAL_MINUTES=15
FILE_TTL_MINUTES=60
INPUT_FILE_TTL_MINUTES=30
OUTPUT_FILE_TTL_MINUTES=120
# ============== Security ==============
# When behind Nginx (production), HSTS is set by the proxy; ENABLE_HSTS applies when running the app without a reverse proxy (e.g. dev).
ENABLE_HSTS=false
# Use "*" only for local development; set explicit origins in production (see .env.example).
CORS_ORIGINS=*
# ============== Admin Authentication ==============
ADMIN_USERNAME=admin
ADMIN_PASSWORD=changeme123 # Change in production!
# Or use a SHA256 hash:
# ADMIN_PASSWORD_HASH=your_sha256_hash
# ============== Monitoring ==============
LOG_LEVEL=INFO
ENABLE_REQUEST_LOGGING=true
MAX_MEMORY_PERCENT=80
```
### Ollama Setup
```bash
# Install Ollama (Windows)
winget install Ollama.Ollama
# Pull a model
ollama pull llama3.2
# For vision/image translation
ollama pull gemma3:12b
# or
ollama pull qwen3-vl:8b
```
## 🎯 Using System Prompts & Glossary
### Example: HVAC Translation
**System Prompt:**
```
You are translating HVAC technical documents.
Use precise technical terminology.
Keep unit measurements (kW, m³/h, Pa) unchanged.
```
**Glossary:**
```
batterie=coil
groupe froid=chiller
CTA=AHU (Air Handling Unit)
échangeur=heat exchanger
vanne 3 voies=3-way valve
```
### Presets Available
- 🔧 **HVAC**: Heating, Ventilation, Air Conditioning
- 💻 **IT**: Software and technology
- ⚖️ **Legal**: Legal documents
- 🏥 **Medical**: Healthcare terminology
## <20> Admin Dashboard
Access the admin dashboard at `/admin` in the frontend. Features:
- **System Status**: Health, uptime, and issues
- **Memory & Disk Monitoring**: Real-time usage stats
- **Translation Statistics**: Total translations, success rate
- **Rate Limit Management**: View active clients and limits
- **Cleanup Service**: Monitor and trigger manual cleanup
### Default Credentials
- **Username**: admin
- **Password**: changeme123
⚠️ **Change the default password in production!**
## 🏗️ Project Structure
```
Translate/
├── main.py # FastAPI application with SaaS features
├── config.py # Configuration with SaaS settings
├── requirements.txt # Dependencies
├── mcp_server.py # MCP server implementation
├── middleware/ # SaaS middleware
│ ├── __init__.py
│ ├── rate_limiting.py # Rate limiting with token bucket
│ ├── validation.py # Input validation
│ ├── security.py # Security headers & logging
│ └── cleanup.py # Auto cleanup service
├── services/
│ └── translation_service.py # Translation providers
├── translators/
│ ├── excel_translator.py # Excel with image support
│ ├── word_translator.py # Word with image support
│ └── pptx_translator.py # PowerPoint with image support
├── frontend/ # Next.js frontend
│ ├── src/
│ │ ├── app/
│ │ │ ├── page.tsx # Main translation page
│ │ │ ├── admin/ # Admin dashboard
│ │ │ └── settings/ # Settings pages
│ │ └── components/
│ └── package.json
├── static/
│ └── webllm.html # WebLLM standalone interface
├── uploads/ # Temporary uploads (auto-cleaned)
└── outputs/ # Translated files (auto-cleaned)
```
## 🛠️ Tech Stack
### Backend
- **FastAPI**: Modern async web framework
- **openpyxl**: Excel manipulation
- **python-docx**: Word documents
- **python-pptx**: PowerPoint presentations
- **deep-translator**: Google/DeepL/Libre translation
- **psutil**: System monitoring
- **python-magic**: File type validation
### Frontend
- **Next.js 15**: React framework
- **Tailwind CSS**: Styling
- **Lucide Icons**: Icon library
- **WebLLM**: Browser-based LLM
## 🔌 MCP Integration
This API can be used as an MCP (Model Context Protocol) server for AI assistants.
### VS Code Configuration
Add to your VS Code `settings.json` or `.vscode/mcp.json`:
```json
{
"servers": {
"document-translator": {
"type": "stdio",
"command": "python",
"args": ["mcp_server.py"],
"cwd": "D:/Translate",
"env": {
"PYTHONPATH": "D:/Translate"
}
}
}
}
```
## 🚀 Production Deployment
### Security Checklist
- [ ] Change `ADMIN_PASSWORD` or set `ADMIN_PASSWORD_HASH`
- [ ] Set `CORS_ORIGINS` to your frontend domain
- [ ] Enable `ENABLE_HSTS=true` if using HTTPS (when not behind Nginx; behind Nginx, HSTS is set by the proxy)
- [ ] Configure rate limits appropriately
- [ ] Set up log rotation for `logs/` directory
- [ ] Use a reverse proxy (nginx/traefik) for HTTPS
### Docker Deployment
En production, utilisez le stack Docker Compose avec Nginx en reverse proxy (ports 80/443):
```bash
# Avec certificats SSL dans docker/nginx/ssl/ (voir DEPLOYMENT_GUIDE.md)
docker compose up -d
```
- **Nginx** : terminaison TLS, HTTP→HTTPS, HSTS, routage `/api/*` → backend, `/*` → frontend (Story 6.5).
- Backend et frontend ne sont pas exposés sur lhôte ; tout passe par le proxy.
- Détails : [DEPLOYMENT_GUIDE.md](./DEPLOYMENT_GUIDE.md) (SSL/TLS, santé `/health`, variables denvironnement).
## 📝 License
MIT License
## 🤝 Contributing
Contributions welcome! Please submit a Pull Request.
---
**Built with ❤️ using Python, FastAPI, Next.js, and Ollama**