rag/README.md

# RAG Modeling Project

## Overview

RAG Modeling is an advanced Retrieval-Augmented Generation system with comprehensive document processing capabilities. The system focuses on extracting high-quality data from PDF documents including text, images, and tables to build robust RAG applications.

## Features

- **Advanced PDF Processing**:
  - Multiple extraction methods for maximum text coverage
  - Image extraction with OCR capabilities
  - Table detection and extraction
  - Structured document parsing

- **Text Processing Pipeline**:
  - Intelligent text chunking for optimal context management
  - Support for multiple languages
  - Metadata preservation

- **Modular Architecture**:
  - Component-based design for easy extension
  - Configurable processing parameters

## Installation

```bash
# Clone the repository
git clone https://gitea.parsanet.org/sepehr/rag.git
cd rag

# Create a virtual environment (optional but recommended)
python -m venv venv
source venv/bin/activate  # On Windows: venv\Scripts\activate

# Install dependencies
pip install -r requirements.txt

# Install additional dependencies
pip install unstructured pytesseract camelot-py opencv-python pandas