This repository contains a Retrieval Augmented Generation (RAG) chatbot implementation that can process data and answer questions based on the provided context.

Requirements

Python Version

⚠️ Important: This project requires Python version lower than 3.12. Python 3.11 works correctly.

Installation

Clone this repository:

git clone <repository-url>
cd <repository-name>

Install the required dependencies:

pip install -r requirement.txt

Usage

Command Line Interface

Run the chatbot in terminal mode:

python cli.py

Web Interface

Launch the Gradio web interface:

python gradio_chatbot.py

RAG Implementation

If you want to import the RAG functionality in your own Python script:

from rag_chatbot import RagChatbot

chatbot = RagChatbot()
response = chatbot.query("your question here")

PDF Processing

The repository includes two methods for processing PDF documents as knowledge sources:

PDF Processing Class

A highly configurable PdfProcessor class is available for extracting text, images, and tables from PDF documents and storing them in a Qdrant vector database.

Key features:

Support for both Ollama and OpenAI models
Configurable embedding, text summarization, and image analysis models
Automatic text chunking based on document structure
Image and table extraction with descriptions
Customizable Qdrant collection configuration

Example usage:

from pdf_processor import PdfProcessor

# Basic usage with default settings
processor = PdfProcessor()
result = processor.process_pdf("path/to/document.pdf")

# Custom configuration
config = {
    "embedding_provider": "openai",
    "image_provider": "openai", 
    "collection_name": "my_documents",
    "openai_api_key": "your-api-key",
    "summary_language": "French"
}
processor = PdfProcessor(config)
processor.process_pdf("path/to/document.pdf")

Jupyter Notebook

For interactive PDF processing, you can also use the Jupyter notebook final_pdf.ipynb.

Project Structure

cli.py: Command-line interface implementation
gradio_chatbot.py: Gradio web interface
rag_chatbot.py: Core RAG implementation
pdf_processor.py: PDF processing and vectorization
final_pdf.ipynb: Jupyter notebook for PDF processing