office_translator/README.md

304 lines
7.5 KiB
Markdown

# Document Translation API
A powerful Python API for translating complex structured documents (Excel, Word, PowerPoint) while **strictly preserving** the original formatting, layout, and embedded media.
## 🎯 Features
### Excel Translation (.xlsx)
- ✅ Translates all cell content and sheet names
- ✅ Preserves cell merging
- ✅ Maintains font styles (size, bold, italic, color)
- ✅ Keeps background colors and borders
- ✅ Translates text within formulas while preserving formula structure
- ✅ Retains embedded images in original positions
### Word Translation (.docx)
- ✅ Translates body text, headers, footers, and tables
- ✅ Preserves heading styles and paragraph formatting
- ✅ Maintains lists (numbered/bulleted)
- ✅ Keeps embedded images, charts, and SmartArt in place
- ✅ Preserves table structures and cell formatting
### PowerPoint Translation (.pptx)
- ✅ Translates slide titles, body text, and speaker notes
- ✅ Preserves slide layouts and transitions
- ✅ Maintains animations
- ✅ Keeps images, videos, and shapes in exact positions
- ✅ Preserves layering order
## 🚀 Quick Start
### Installation
1. **Clone the repository:**
```powershell
git clone <repository-url>
cd Translate
```
2. **Create a virtual environment:**
```powershell
python -m venv venv
.\venv\Scripts\Activate.ps1
```
3. **Install dependencies:**
```powershell
pip install -r requirements.txt
```
4. **Configure environment:**
```powershell
cp .env.example .env
# Edit .env with your preferred settings
```
5. **Run the API:**
```powershell
python main.py
```
The API will start on `http://localhost:8000`
## 📚 API Documentation
Once the server is running, visit:
- **Swagger UI**: http://localhost:8000/docs
- **ReDoc**: http://localhost:8000/redoc
## 🔧 API Endpoints
### POST /translate
Translate a single document
**Request:**
```bash
curl -X POST "http://localhost:8000/translate" \
-F "file=@document.xlsx" \
-F "target_language=es" \
-F "source_language=auto"
```
**Response:**
Returns the translated document file
### POST /translate-batch
Translate multiple documents at once
**Request:**
```bash
curl -X POST "http://localhost:8000/translate-batch" \
-F "files=@document1.docx" \
-F "files=@document2.pptx" \
-F "target_language=fr"
```
### GET /languages
Get list of supported language codes
### GET /health
Health check endpoint
## 💻 Usage Examples
### Python Example
```python
import requests
# Translate a document
with open('document.xlsx', 'rb') as f:
files = {'file': f}
data = {
'target_language': 'es',
'source_language': 'auto'
}
response = requests.post('http://localhost:8000/translate', files=files, data=data)
# Save translated file
with open('translated_document.xlsx', 'wb') as output:
output.write(response.content)
```
### JavaScript/TypeScript Example
```javascript
const formData = new FormData();
formData.append('file', fileInput.files[0]);
formData.append('target_language', 'fr');
formData.append('source_language', 'auto');
const response = await fetch('http://localhost:8000/translate', {
method: 'POST',
body: formData
});
const blob = await response.blob();
const url = window.URL.createObjectURL(blob);
const a = document.createElement('a');
a.href = url;
a.download = 'translated_document.docx';
a.click();
```
### PowerShell Example
```powershell
$file = Get-Item "document.pptx"
$uri = "http://localhost:8000/translate"
$form = @{
file = $file
target_language = "de"
source_language = "auto"
}
Invoke-RestMethod -Uri $uri -Method Post -Form $form -OutFile "translated_document.pptx"
```
## 🌐 Supported Languages
The API supports 25+ languages including:
- Spanish (es), French (fr), German (de)
- Italian (it), Portuguese (pt), Russian (ru)
- Chinese (zh), Japanese (ja), Korean (ko)
- Arabic (ar), Hindi (hi), Dutch (nl)
- And many more...
Full list available at: `GET /languages`
## ⚙️ Configuration
Edit `.env` file to configure:
```env
# Translation Service (google, deepl, libre)
TRANSLATION_SERVICE=google
# DeepL API Key (if using DeepL)
DEEPL_API_KEY=your_api_key_here
# File Upload Limits
MAX_FILE_SIZE_MB=50
# Directory Configuration
UPLOAD_DIR=./uploads
OUTPUT_DIR=./outputs
```
## 🔌 Model Context Protocol (MCP) Integration
This API is designed to be easily wrapped as an MCP server for future integration with AI assistants and tools.
### MCP Server Structure (Future Implementation)
```json
{
"mcpServers": {
"document-translator": {
"command": "python",
"args": ["-m", "mcp_server"],
"env": {
"API_URL": "http://localhost:8000"
}
}
}
}
```
### Example MCP Tools
The MCP wrapper will expose these tools:
1. **translate_document** - Translate a single document
2. **translate_batch** - Translate multiple documents
3. **get_supported_languages** - List supported languages
4. **check_translation_status** - Check status of translation
## 🏗️ Project Structure
```
Translate/
├── main.py # FastAPI application
├── config.py # Configuration management
├── requirements.txt # Dependencies
├── .env.example # Environment template
├── services/
│ ├── __init__.py
│ └── translation_service.py # Translation abstraction layer
├── translators/
│ ├── __init__.py
│ ├── excel_translator.py # Excel translation logic
│ ├── word_translator.py # Word translation logic
│ └── pptx_translator.py # PowerPoint translation logic
├── utils/
│ ├── __init__.py
│ ├── file_handler.py # File operations
│ └── exceptions.py # Custom exceptions
├── uploads/ # Temporary upload storage
└── outputs/ # Translated files
```
## 🧪 Testing
### Manual Testing
1. Start the API server
2. Navigate to http://localhost:8000/docs
3. Use the interactive Swagger UI to test endpoints
### Test Files
Prepare test files with:
- Complex formatting (multiple fonts, colors, styles)
- Embedded images and media
- Tables and merged cells
- Formulas (for Excel)
- Multiple sections/slides
## 🛠️ Technical Details
### Libraries Used
- **FastAPI**: Modern web framework for building APIs
- **openpyxl**: Excel file manipulation with formatting preservation
- **python-docx**: Word document handling
- **python-pptx**: PowerPoint presentation processing
- **deep-translator**: Multi-provider translation service
- **Uvicorn**: ASGI server for running FastAPI
### Design Principles
1. **Modular Architecture**: Each file type has its own translator module
2. **Provider Abstraction**: Easy to swap translation services (Google, DeepL, LibreTranslate)
3. **Format Preservation**: All translators maintain original document structure
4. **Error Handling**: Comprehensive error handling and logging
5. **Scalability**: Ready for MCP integration and microservices architecture
## 🔐 Security Considerations
For production deployment:
1. **Configure CORS** properly in `main.py`
2. **Add authentication** for API endpoints
3. **Implement rate limiting** to prevent abuse
4. **Use HTTPS** for secure file transmission
5. **Sanitize file uploads** to prevent malicious files
6. **Set appropriate file size limits**
## 📝 License
MIT License - Feel free to use this project for your needs.
## 🤝 Contributing
Contributions are welcome! Please feel free to submit a Pull Request.
## 📧 Support
For issues and questions, please open an issue on the repository.
---
**Built with ❤️ using Python and FastAPI**