# Spécification Technique — Document Parsing & Q&A (Analyse PDF) ## A. Mises à jour du Schéma Prisma ### A1. Modèle `NoteAttachment` Stocke les fichiers attachés à une note (PDF, images, documents). ```prisma model NoteAttachment { id String @id @default(cuid()) noteId String fileName String fileType String // "application/pdf", "image/png", etc. fileSize Int // en bytes filePath String // chemin local: data/uploads/attachments/{noteId}/{uuid}.pdf mimeType String // redondant avec fileType pour requêtes rapides status String @default("pending") // pending → processing → ready → failed pageCount Int? // nombre de pages (PDF uniquement) error String? // message d'erreur si failed createdAt DateTime @default(now()) updatedAt DateTime @updatedAt note Note @relation(fields: [noteId], references: [id], onDelete: Cascade) chunks DocumentChunk[] @@index([noteId]) @@index([status]) } ``` ### A2. Modèle `DocumentChunk` Fragments vectorisés d'un document. Chaque chunk est lié à un attachment ET transitivement à une note. ```prisma model DocumentChunk { id String @id @default(cuid()) attachmentId String content String // texte du fragment (800-1200 tokens) chunkIndex Int // position ordinale dans le document (0, 1, 2…) pageNumber Int? // page source (pour citation) startChar Int? // offset caractère de début dans le texte extrait endChar Int? // offset caractère de fin metadata String? // JSON: { heading, section, tableCaption… } embedding Unsupported("vector(1536)")? createdAt DateTime @default(now()) attachment NoteAttachment @relation(fields: [attachmentId], references: [id], onDelete: Cascade) @@index([attachmentId]) @@index([attachmentId, chunkIndex]) } ``` ### A3. Ajout à `Note` ```prisma model Note { // … champs existants … attachments NoteAttachment[] } ``` ### A4. Migration SQL brute — Index HNSW pour DocumentChunk ```sql -- À ajouter dans la migration Prisma (migration.sql) CREATE INDEX IF NOT EXISTS "DocumentChunk_embedding_hnsw_idx" ON "DocumentChunk" USING hnsw ("embedding" vector_cosine_ops) WITH (m = 16, ef_construction = 64); ``` --- ## B. Pipeline d'Ingestion (Chunking & Embeddings) ### B1. Architecture du pipeline ``` PDF upload → NoteAttachment (status: pending) ↓ pdf-parse extraction (texte brut + métadonnées pages) ↓ Structural Chunking (800 chars, overlap 200, respect des pages) ↓ DocumentChunk.create (content, chunkIndex, pageNumber, metadata) ↓ Batch embeddings (Promise.all par batch de 20) ↓ SQL UPDATE embedding sur chaque chunk ↓ NoteAttachment.update (status: ready) ``` ### B2. Service d'extraction — `document-extraction.service.ts` ```typescript // lib/ai/services/document-extraction.service.ts import pdf from 'pdf-parse' interface ExtractedPage { pageNumber: number text: string } interface ExtractedDocument { pages: ExtractedPage[] totalPages: number metadata: { title?: string; author?: string } } export class DocumentExtractionService { async extractPdf(filePath: string): Promise { const dataBuffer = fs.readFileSync(filePath) const data = await pdf(dataBuffer, { max: 0, // toutes les pages }) // pdf-parse ne donne pas les pages directement, // on utilise un custom page renderer const pages: ExtractedPage[] = [] let currentPage = 0 // NLP page renderer: each page separated const renderer = { renderPage: (pageData: any) => { currentPage++ const text = pageData.text pages.push({ pageNumber: currentPage, text }) return '' } } // Re-parse avec le renderer await pdf(dataBuffer, { pagerender: renderer.renderPage }) return { pages, totalPages: data.numpages, metadata: { title: data.info?.Title, author: data.info?.Author, }, } } } export const documentExtractionService = new DocumentExtractionService() ``` ### B3. Stratégie de Chunking — `document-chunking.service.ts` **Principes :** 1. **Taille cible** : 800 caractères (~200 tokens), avec overlap de 200 caractères 2. **Respect des frontières de page** : un chunk ne chevauche JAMAIS deux pages. Si la coupure tombe au milieu d'une page, on ajuste. 3. **Respect des sections** : les headings (lignes en MAJUSCULES ou préfixées par `#`, `##`) démarrent un nouveau chunk 4. **Overlap contextuel** : les 200 derniers caractères du chunk N sont répétés au début du chunk N+1 5. **Tables** : conservées en entier dans un seul chunk si < 1500 chars, sinon découpées par ligne avec en-tête répété ```typescript // lib/ai/services/document-chunking.service.ts interface ChunkInput { text: string pageNumber: number } interface DocumentChunkData { content: string chunkIndex: number pageNumber: number startChar: number endChar: number metadata?: string } export class DocumentChunkingService { private readonly CHUNK_SIZE = 800 private readonly OVERLAP = 200 private readonly MAX_CHUNK_SIZE = 1500 chunk(pages: ChunkInput[]): DocumentChunkData[] { const chunks: DocumentChunkData[] = [] let globalIndex = 0 let previousTail = '' for (const page of pages) { const text = page.text.trim() if (!text) continue // Découper en sections (par headings ou paragraphes) const sections = this.splitSections(text) let buffer = previousTail let bufferStart = 0 for (const section of sections) { if (buffer.length + section.length > this.CHUNK_SIZE && buffer.length > 0) { // Flush le buffer comme un chunk chunks.push({ content: buffer.trim(), chunkIndex: globalIndex++, pageNumber: page.pageNumber, startChar: bufferStart, endChar: bufferStart + buffer.length, }) // Overlap: garder les derniers OVERLAP chars previousTail = buffer.slice(-this.OVERLAP) buffer = previousTail + '\n' + section bufferStart += buffer.length - section.length - previousTail.length } else { buffer += (buffer ? '\n\n' : '') + section } } // Flush le reste if (buffer.trim()) { chunks.push({ content: buffer.trim(), chunkIndex: globalIndex++, pageNumber: page.pageNumber, startChar: bufferStart, endChar: bufferStart + buffer.length, }) previousTail = buffer.slice(-this.OVERLAP) } } return chunks } private splitSections(text: string): string[] { const lines = text.split('\n') const sections: string[] = [] let current = '' for (const line of lines) { const isHeading = /^(#{1,6}\s|[A-Z][A-Z\s]{5,}$)/.test(line.trim()) if (isHeading && current.trim()) { sections.push(current.trim()) current = line } else { current += (current ? '\n' : '') + line } } if (current.trim()) sections.push(current.trim()) return sections } } export const documentChunkingService = new DocumentChunkingService() ``` ### B4. Service d'ingestion orchestrateur — `document-ingestion.service.ts` ```typescript // lib/ai/services/document-ingestion.service.ts export class DocumentIngestionService { async ingest(attachmentId: string): Promise { const attachment = await prisma.noteAttachment.findUnique({ where: { id: attachmentId }, }) if (!attachment) throw new Error('Attachment not found') await prisma.noteAttachment.update({ where: { id: attachmentId }, data: { status: 'processing' }, }) try { // 1. Extraction const extracted = await documentExtractionService.extractPdf(attachment.filePath) await prisma.noteAttachment.update({ where: { id: attachmentId }, data: { pageCount: extracted.totalPages }, }) // 2. Chunking const chunkInputs = extracted.pages.map(p => ({ text: p.text, pageNumber: p.pageNumber, })) const chunks = documentChunkingService.chunk(chunkInputs) // 3. Créer les chunks en DB (sans embedding) const created = await Promise.all( chunks.map(c => prisma.documentChunk.create({ data: { attachmentId, content: c.content, chunkIndex: c.chunkIndex, pageNumber: c.pageNumber, startChar: c.startChar, endChar: c.endChar, metadata: c.metadata, }, }) ) ) // 4. Batch embeddings (par batch de 20) const BATCH_SIZE = 20 for (let i = 0; i < created.length; i += BATCH_SIZE) { const batch = created.slice(i, i + BATCH_SIZE) const texts = batch.map(c => c.content) const embeddings = await embeddingService.generateBatchEmbeddings(texts) await Promise.all( batch.map((chunk, idx) => prisma.$executeRawUnsafe( `UPDATE "DocumentChunk" SET embedding = $1::vector WHERE id = $2`, embeddingService.toVectorString(embeddings[idx].embedding), chunk.id ) ) ) } // 5. Marquer prêt await prisma.noteAttachment.update({ where: { id: attachmentId }, data: { status: 'ready' }, }) } catch (error: any) { await prisma.noteAttachment.update({ where: { id: attachmentId }, data: { status: 'failed', error: error.message }, }) throw error } } } export const documentIngestionService = new DocumentIngestionService() ``` ### B5. Route API d'upload ```typescript // app/api/notes/[noteId]/attachments/route.ts export async function POST(req, { params }) { const session = await auth() if (!session?.user?.id) return unauthorized() const { noteId } = await params const formData = await req.formData() const file = formData.get('file') as File // Validation if (file.size > 20 * 1024 * 1024) return error('File too large (max 20MB)') if (file.type !== 'application/pdf') return error('Only PDF supported') // Sauvegarder le fichier const dir = `data/uploads/attachments/${noteId}` fs.mkdirSync(dir, { recursive: true }) const filePath = path.join(dir, `${uuid()}.pdf`) fs.writeFileSync(filePath, Buffer.from(await file.arrayBuffer())) // Créer l'attachment const attachment = await prisma.noteAttachment.create({ data: { noteId, fileName: file.name, fileType: file.type, fileSize: file.size, filePath, mimeType: file.type, status: 'pending', }, }) // Lancer l'ingestion en arrière-plan (setImmediate) setImmediate(() => documentIngestionService.ingest(attachment.id)) return NextResponse.json({ success: true, data: attachment }) } ``` --- ## C. Interface du Nouvel Outil Agent — `document_search` ### C1. Enregistrement dans le registre ```typescript // lib/ai/tools/document-search.tool.ts toolRegistry.register({ name: 'document_search', description: 'Search within PDF documents attached to notes. Returns relevant passages with page numbers and source document info.', isInternal: true, buildTool: (ctx) => tool({ description: `Search within PDF documents attached to the user's notes. Returns matching passages with page numbers, chunk content, and the source note/document info. Use this when the user asks about specific documents, PDFs, or attached files. Can search across all documents or within a specific note's attachments.`, inputSchema: z.object({ query: z.string().describe('The search query to find relevant passages in documents'), noteId: z.string().optional().describe('Optional: restrict search to attachments of a specific note'), limit: z.number().optional().describe('Max results to return (default 5)').default(5), }), execute: async ({ query, noteId, limit = 5 }) => { try { const queryEmbedding = await embeddingService.generateEmbedding(query) const vectorStr = embeddingService.toVectorString(queryEmbedding.embedding) let noteFilter = '' const params: any[] = [vectorStr, limit] if (noteId) { assertSafeId(noteId, 'noteId') noteFilter = `AND na."noteId" = $${params.length}` params.push(noteId) } else if (ctx.notebookId) { assertSafeId(ctx.notebookId, 'notebookId') noteFilter = `AND n."notebookId" = $${params.length}` params.push(ctx.notebookId) } const userId = ctx.userId assertSafeId(userId, 'userId') params.push(userId) const results = await prisma.$queryRawUnsafe( `SELECT dc.id as chunkId, dc.content, dc."pageNumber", dc."chunkIndex", dc.metadata, na.id as "attachmentId", na."fileName", na."pageCount", na."noteId", n.title as "noteTitle", dc.embedding::text <=> $1::vector as distance FROM "DocumentChunk" dc JOIN "NoteAttachment" na ON na.id = dc."attachmentId" JOIN "Note" n ON n.id = na."noteId" WHERE dc.embedding IS NOT NULL AND na.status = 'ready' AND n."trashedAt" IS NULL AND n."userId" = $${params.length} ${noteFilter} ORDER BY dc.embedding::text <=> $1::vector LIMIT $2`, ...params ) as any[] if (!results.length) return { results: [], message: 'No matching documents found' } const threshold = 0.5 return results .filter(r => r.distance < threshold) .map(r => ({ content: r.content.substring(0, 600), pageNumber: r.pageNumber, chunkIndex: r.chunkIndex, fileName: r.fileName, noteId: r.noteId, noteTitle: r.noteTitle || 'Untitled', score: Math.max(0, 1 - r.distance), })) } catch (e: any) { return { error: `Document search failed: ${e.message}` } } }, }), }) ``` ### C2. Auto-enregistrement Ajout dans `lib/ai/tools/index.ts` : ```typescript import './document-search' ``` ### C3. Activation dans le Chat Mise à jour de `registry.ts` — `buildToolsForChat` : ```typescript buildToolsForChat(ctx: ToolContext): Tool[] { const tools: Tool[] = [] tools.push(this.build('note_search', ctx)) tools.push(this.build('note_read', ctx)) tools.push(this.build('document_search', ctx)) // <-- NOUVEAU if (ctx.webSearch) { tools.push(this.build('web_search', ctx)) tools.push(this.build('web_scrape', ctx)) } return tools } ``` --- ## D. Logique de Requêtage RAG ### D1. Recherche hybride étendue — `semantic-search.service.ts` Ajout d'une méthode `searchWithDocuments` qui combine notes ET chunks de documents : ```typescript async searchWithDocuments( userId: string, query: string, options?: SearchOptions & { noteId?: string; includeDocuments?: boolean } ): Promise<(SearchResult & { source?: 'note' | 'document'; pageNumber?: number; fileName?: string })[]> { const includeDocuments = options?.includeDocuments !== false // Phase 1: Recherche notes existante (FTS + pgvector + RRF) const noteResults = await this.searchAsUser(userId, query, options) // Phase 2: Recherche dans les documents (pgvector uniquement) let documentResults: any[] = [] if (includeDocuments) { const queryEmbedding = await embeddingService.generateEmbedding(query) const vectorStr = embeddingService.toVectorString(queryEmbedding.embedding) const params: any[] = [vectorStr, 50, userId] let noteFilter = '' if (options?.noteId) { assertSafeId(options.noteId, 'noteId') noteFilter = `AND na."noteId" = $${params.length + 1}` params.push(options.noteId) } if (options?.notebookId) { assertSafeId(options.notebookId, 'notebookId') noteFilter += ` AND n."notebookId" = $${params.length + 1}` params.push(options.notebookId) } documentResults = await prisma.$queryRawUnsafe( `SELECT dc.content, dc."pageNumber", na."fileName", na."noteId", n.title as "noteTitle", 1 - (dc.embedding::text <=> $1::vector) as score FROM "DocumentChunk" dc JOIN "NoteAttachment" na ON na.id = dc."attachmentId" JOIN "Note" n ON n.id = na."noteId" WHERE dc.embedding IS NOT NULL AND na.status = 'ready' AND n."trashedAt" IS NULL AND n."userId" = $3 ${noteFilter} ORDER BY dc.embedding::text <=> $1::vector LIMIT $2`, ...params ) as any[] } // Phase 3: Fusion RRF entre notes et documents const K = 60 const fused = new Map() for (let i = 0; i < noteResults.length; i++) { const r = noteResults[i] fused.set(r.noteId, { ...r, source: 'note', rrfScore: 1 / (K + i + 1), }) } for (let i = 0; i < documentResults.length; i++) { const r = documentResults[i] const key = `doc_${r.noteId}_${r.pageNumber}_${i}` fused.set(key, { noteId: r.noteId, title: `${r.noteTitle || 'Untitled'} → ${r.fileName} (p.${r.pageNumber})`, content: r.content.substring(0, 500), score: r.score, matchType: 'related', source: 'document', pageNumber: r.pageNumber, fileName: r.fileName, rrfScore: 1 / (K + i + 1), }) } return Array.from(fused.values()) .sort((a, b) => b.rrfScore - a.rrfScore) .slice(0, options?.limit || 20) } ``` ### D2. Logique de priorisation dans le Chat RAG Mise à jour de `app/api/chat/route.ts` : ```typescript // Dans le handler du chat, avant d'injecter le contexte : let contextNotes = '' // Si l'utilisateur mentionne un document/PDF spécifique const documentMention = userMessage.match( /\b(pdf|document|fichier|pi[eè]ce jointe|attachment|file)\b/i ) const specificNote = userMessage.match( /(?:dans|sur|de|du|la|le) (?:cette note|ce document|cette page)/i ) if (specificNote && notebookId) { // MODE CIBLE : chercher SEULEMENT dans les documents de cette note const docResults = await semanticSearchService.searchWithDocuments( userId, userMessage, { noteId: currentNoteId, includeDocuments: true, limit: 5 } ) contextNotes = docResults.map(r => r.source === 'document' ? `[DOCUMENT: ${r.fileName} p.${r.pageNumber}]\n${r.content}` : `[NOTE: ${r.title}]\n${r.content}` ).join('\n\n---\n\n') } else { // MODE GLOBAL : recherche étendue notes + documents const results = await semanticSearchService.searchWithDocuments( userId, userMessage, { notebookId, includeDocuments: !!documentMention, limit: 10 } ) contextNotes = results.map(r => r.source === 'document' ? `[DOCUMENT: ${r.fileName} p.${r.pageNumber}]\n${r.content}` : `[NOTE: ${r.title}]\n${r.content}` ).join('\n\n---\n\n') } ``` ### D3. Prompt système mis à jour ```typescript const systemPrompt = `Tu es l'IA Note de Memento, l'assistant intelligent de prise de notes. CONTEXTES DISPONIBLES : - [NOTE: titre] → contenu d'une note de l'utilisateur - [DOCUMENT: fichier.pdf p.X] → passage extrait d'un PDF attaché à une note RÈGLES POUR LES DOCUMENTS : - Cite toujours le nom du fichier et le numéro de page quand tu te réfères à un document - Si l'utilisateur pose une question sur "ce document" ou "le PDF", base ta réponse uniquement sur les passages [DOCUMENT] - Si les passages sont insuffisants, dis-le clairement plutôt que de deviner - Pour les tableaux et données chiffrées, reproduis-les fidèlement ...` ``` ### D4. SQL — Requête de débogage / test ```sql -- Test : recherche dans les chunks d'un document spécifique SELECT dc.content, dc."pageNumber", dc."chunkIndex", na."fileName", n.title as note_title, dc.embedding::text <=> '[0.01, 0.02, ...]'::vector as distance FROM "DocumentChunk" dc JOIN "NoteAttachment" na ON na.id = dc."attachmentId" JOIN "Note" n ON n.id = na."noteId" WHERE na.status = 'ready' AND n."trashedAt" IS NULL ORDER BY dc.embedding::text <=> '[0.01, 0.02, ...]'::vector LIMIT 10; ``` --- ## Résumé des fichiers à créer/modifier | Action | Fichier | |---|---| | **CRÉER** | `prisma/migrations/XXX_add_note_attachment_document_chunk/migration.sql` | | **MODIFIER** | `prisma/schema.prisma` — ajouter NoteAttachment, DocumentChunk, relation sur Note | | **CRÉER** | `lib/ai/services/document-extraction.service.ts` | | **CRÉER** | `lib/ai/services/document-chunking.service.ts` | | **CRÉER** | `lib/ai/services/document-ingestion.service.ts` | | **CRÉER** | `lib/ai/tools/document-search.tool.ts` | | **MODIFIER** | `lib/ai/tools/index.ts` — ajouter import document-search | | **MODIFIER** | `lib/ai/tools/registry.ts` — ajouter document_search dans buildToolsForChat | | **CRÉER** | `app/api/notes/[noteId]/attachments/route.ts` — upload | | **CRÉER** | `app/api/notes/[noteId]/attachments/[attachmentId]/route.ts` — GET status, DELETE | | **MODIFIER** | `lib/ai/services/semantic-search.service.ts` — ajouter searchWithDocuments | | **MODIFIER** | `app/api/chat/route.ts` — contexte documents dans le RAG |