39 lines
1.2 KiB
Markdown
39 lines
1.2 KiB
Markdown
# RAG Modeling Project
|
|
|
|
## Overview
|
|
|
|
RAG Modeling is an advanced Retrieval-Augmented Generation system with comprehensive document processing capabilities. The system focuses on extracting high-quality data from PDF documents including text, images, and tables to build robust RAG applications.
|
|
|
|
## Features
|
|
|
|
- **Advanced PDF Processing**:
|
|
- Multiple extraction methods for maximum text coverage
|
|
- Image extraction with OCR capabilities
|
|
- Table detection and extraction
|
|
- Structured document parsing
|
|
|
|
- **Text Processing Pipeline**:
|
|
- Intelligent text chunking for optimal context management
|
|
- Support for multiple languages
|
|
- Metadata preservation
|
|
|
|
- **Modular Architecture**:
|
|
- Component-based design for easy extension
|
|
- Configurable processing parameters
|
|
|
|
## Installation
|
|
|
|
```bash
|
|
# Clone the repository
|
|
git clone https://gitea.parsanet.org/sepehr/rag.git
|
|
cd rag
|
|
|
|
# Create a virtual environment (optional but recommended)
|
|
python -m venv venv
|
|
source venv/bin/activate # On Windows: venv\Scripts\activate
|
|
|
|
# Install dependencies
|
|
pip install -r requirements.txt
|
|
|
|
# Install additional dependencies
|
|
pip install unstructured pytesseract camelot-py opencv-python pandas |