RAG Modeling Project

Overview

RAG Modeling is an advanced Retrieval-Augmented Generation system with comprehensive document processing capabilities. The system focuses on extracting high-quality data from PDF documents including text, images, and tables to build robust RAG applications.

Features

  • Advanced PDF Processing:

    • Multiple extraction methods for maximum text coverage
    • Image extraction with OCR capabilities
    • Table detection and extraction
    • Structured document parsing
  • Text Processing Pipeline:

    • Intelligent text chunking for optimal context management
    • Support for multiple languages
    • Metadata preservation
  • Modular Architecture:

    • Component-based design for easy extension
    • Configurable processing parameters

Installation

# Clone the repository
git clone https://gitea.parsanet.org/sepehr/rag.git
cd rag

# Create a virtual environment (optional but recommended)
python -m venv venv
source venv/bin/activate  # On Windows: venv\Scripts\activate

# Install dependencies
pip install -r requirements.txt

# Install additional dependencies
pip install unstructured pytesseract camelot-py opencv-python pandas
Description
No description provided
Readme 4.6 MiB
Languages
Jupyter Notebook 96.9%
Python 3.1%