Momento/docs/spec-document-qa.md

# Spécification Technique — Document Parsing & Q&A (Analyse PDF)

## A. Mises à jour du Schéma Prisma

### A1. Modèle `NoteAttachment`

Stocke les fichiers attachés à une note (PDF, images, documents).

```prisma
model NoteAttachment {
  id          String   @id @default(cuid())
  noteId      String
  fileName    String
  fileType    String              // "application/pdf", "image/png", etc.
  fileSize    Int                 // en bytes
  filePath    String              // chemin local: data/uploads/attachments/{noteId}/{uuid}.pdf
  mimeType    String              // redondant avec fileType pour requêtes rapides
  status      String   @default("pending")  // pending → processing → ready → failed
  pageCount   Int?                // nombre de pages (PDF uniquement)
  error       String?             // message d'erreur si failed
  createdAt   DateTime @default(now())
  updatedAt   DateTime @updatedAt

  note        Note                @relation(fields: [noteId], references: [id], onDelete: Cascade)
  chunks      DocumentChunk[]

  @@index([noteId])
  @@index([status])
}
```

### A2. Modèle `DocumentChunk`

Fragments vectorisés d'un document. Chaque chunk est lié à un attachment ET transitivement à une note.

```prisma
model DocumentChunk {
  id           String    @id @default(cuid())
  attachmentId String
  content      String                       // texte du fragment (800-1200 tokens)
  chunkIndex   Int                          // position ordinale dans le document (0, 1, 2…)
  pageNumber   Int?                         // page source (pour citation)
  startChar    Int?                         // offset caractère de début dans le texte extrait
  endChar      Int?                         // offset caractère de fin
  metadata     String?                      // JSON: { heading, section, tableCaption… }
  embedding    Unsupported("vector(1536)")?
  createdAt    DateTime  @default(now())

  attachment   NoteAttachment @relation(fields: [attachmentId], references: [id], onDelete: Cascade)

  @@index([attachmentId])
  @@index([attachmentId, chunkIndex])
}
```

### A3. Ajout à `Note`

```prisma
model Note {
  // … champs existants …

  attachments    NoteAttachment[]
}
```

### A4. Migration SQL brute — Index HNSW pour DocumentChunk

```sql
-- À ajouter dans la migration Prisma (migration.sql)
CREATE INDEX IF NOT EXISTS "DocumentChunk_embedding_hnsw_idx"
  ON "DocumentChunk" USING hnsw ("embedding" vector_cosine_ops)
  WITH (m = 16, ef_construction = 64);
```

---

## B. Pipeline d'Ingestion (Chunking & Embeddings)

### B1. Architecture du pipeline

```
PDF upload → NoteAttachment (status: pending)
     ↓
pdf-parse extraction (texte brut + métadonnées pages)
     ↓
Structural Chunking (800 chars, overlap 200, respect des pages)
     ↓
DocumentChunk.create (content, chunkIndex, pageNumber, metadata)
     ↓
Batch embeddings (Promise.all par batch de 20)
     ↓
SQL UPDATE embedding sur chaque chunk
     ↓
NoteAttachment.update (status: ready)
```

### B2. Service d'extraction — `document-extraction.service.ts`

```typescript
// lib/ai/services/document-extraction.service.ts

import pdf from 'pdf-parse'

interface ExtractedPage {
  pageNumber: number
  text: string
}

interface ExtractedDocument {
  pages: ExtractedPage[]
  totalPages: number
  metadata: { title?: string; author?: string }
}

export class DocumentExtractionService {
  async extractPdf(filePath: string): Promise<ExtractedDocument> {
    const dataBuffer = fs.readFileSync(filePath)
    const data = await pdf(dataBuffer, {
      max: 0, // toutes les pages
    })

    // pdf-parse ne donne pas les pages directement,
    // on utilise un custom page renderer
    const pages: ExtractedPage[] = []
    let currentPage = 0

    // NLP page renderer: each page separated
    const renderer = {
      renderPage: (pageData: any) => {
        currentPage++
        const text = pageData.text
        pages.push({ pageNumber: currentPage, text })
        return ''
      }
    }

    // Re-parse avec le renderer
    await pdf(dataBuffer, { pagerender: renderer.renderPage })

    return {
      pages,
      totalPages: data.numpages,
      metadata: {
        title: data.info?.Title,
        author: data.info?.Author,
      },
    }
  }
}

export const documentExtractionService = new DocumentExtractionService()
```

### B3. Stratégie de Chunking — `document-chunking.service.ts`

**Principes :**

1. **Taille cible** : 800 caractères (~200 tokens), avec overlap de 200 caractères
2. **Respect des frontières de page** : un chunk ne chevauche JAMAIS deux pages. Si la coupure tombe au milieu d'une page, on ajuste.
3. **Respect des sections** : les headings (lignes en MAJUSCULES ou préfixées par `#`, `##`) démarrent un nouveau chunk
4. **Overlap contextuel** : les 200 derniers caractères du chunk N sont répétés au début du chunk N+1
5. **Tables** : conservées en entier dans un seul chunk si < 1500 chars, sinon découpées par ligne avec en-tête répété

```typescript
// lib/ai/services/document-chunking.service.ts

interface ChunkInput {
  text: string
  pageNumber: number
}

interface DocumentChunkData {
  content: string
  chunkIndex: number
  pageNumber: number
  startChar: number
  endChar: number
  metadata?: string
}

export class DocumentChunkingService {
  private readonly CHUNK_SIZE = 800
  private readonly OVERLAP = 200
  private readonly MAX_CHUNK_SIZE = 1500

  chunk(pages: ChunkInput[]): DocumentChunkData[] {
    const chunks: DocumentChunkData[] = []
    let globalIndex = 0
    let previousTail = ''

    for (const page of pages) {
      const text = page.text.trim()
      if (!text) continue

      // Découper en sections (par headings ou paragraphes)
      const sections = this.splitSections(text)

      let buffer = previousTail
      let bufferStart = 0

      for (const section of sections) {
        if (buffer.length + section.length > this.CHUNK_SIZE && buffer.length > 0) {
          // Flush le buffer comme un chunk
          chunks.push({
            content: buffer.trim(),
            chunkIndex: globalIndex++,
            pageNumber: page.pageNumber,
            startChar: bufferStart,
            endChar: bufferStart + buffer.length,
          })
          // Overlap: garder les derniers OVERLAP chars
          previousTail = buffer.slice(-this.OVERLAP)
          buffer = previousTail + '\n' + section
          bufferStart += buffer.length - section.length - previousTail.length
        } else {
          buffer += (buffer ? '\n\n' : '') + section
        }
      }

      // Flush le reste
      if (buffer.trim()) {
        chunks.push({
          content: buffer.trim(),
          chunkIndex: globalIndex++,
          pageNumber: page.pageNumber,
          startChar: bufferStart,
          endChar: bufferStart + buffer.length,
        })
        previousTail = buffer.slice(-this.OVERLAP)
      }
    }

    return chunks
  }

  private splitSections(text: string): string[] {
    const lines = text.split('\n')
    const sections: string[] = []
    let current = ''

    for (const line of lines) {
      const isHeading = /^(#{1,6}\s|[A-Z][A-Z\s]{5,}$)/.test(line.trim())
      if (isHeading && current.trim()) {
        sections.push(current.trim())
        current = line
      } else {
        current += (current ? '\n' : '') + line
      }
    }
    if (current.trim()) sections.push(current.trim())
    return sections
  }
}

export const documentChunkingService = new DocumentChunkingService()
```

### B4. Service d'ingestion orchestrateur — `document-ingestion.service.ts`

```typescript
// lib/ai/services/document-ingestion.service.ts

export class DocumentIngestionService {
  async ingest(attachmentId: string): Promise<void> {
    const attachment = await prisma.noteAttachment.findUnique({
      where: { id: attachmentId },
    })
    if (!attachment) throw new Error('Attachment not found')

    await prisma.noteAttachment.update({
      where: { id: attachmentId },
      data: { status: 'processing' },
    })

    try {
      // 1. Extraction
      const extracted = await documentExtractionService.extractPdf(attachment.filePath)

      await prisma.noteAttachment.update({
        where: { id: attachmentId },
        data: { pageCount: extracted.totalPages },
      })

      // 2. Chunking
      const chunkInputs = extracted.pages.map(p => ({
        text: p.text,
        pageNumber: p.pageNumber,
      }))
      const chunks = documentChunkingService.chunk(chunkInputs)

      // 3. Créer les chunks en DB (sans embedding)
      const created = await Promise.all(
        chunks.map(c =>
          prisma.documentChunk.create({
            data: {
              attachmentId,
              content: c.content,
              chunkIndex: c.chunkIndex,
              pageNumber: c.pageNumber,
              startChar: c.startChar,
              endChar: c.endChar,
              metadata: c.metadata,
            },
          })
        )
      )

      // 4. Batch embeddings (par batch de 20)
      const BATCH_SIZE = 20
      for (let i = 0; i < created.length; i += BATCH_SIZE) {
        const batch = created.slice(i, i + BATCH_SIZE)
        const texts = batch.map(c => c.content)
        const embeddings = await embeddingService.generateBatchEmbeddings(texts)

        await Promise.all(
          batch.map((chunk, idx) =>
            prisma.$executeRawUnsafe(
              `UPDATE "DocumentChunk" SET embedding = $1::vector WHERE id = $2`,
              embeddingService.toVectorString(embeddings[idx].embedding),
              chunk.id
            )
          )
        )
      }

      // 5. Marquer prêt
      await prisma.noteAttachment.update({
        where: { id: attachmentId },
        data: { status: 'ready' },
      })
    } catch (error: any) {
      await prisma.noteAttachment.update({
        where: { id: attachmentId },
        data: { status: 'failed', error: error.message },
      })
      throw error
    }
  }
}

export const documentIngestionService = new DocumentIngestionService()
```

### B5. Route API d'upload

```typescript
// app/api/notes/[noteId]/attachments/route.ts

export async function POST(req, { params }) {
  const session = await auth()
  if (!session?.user?.id) return unauthorized()

  const { noteId } = await params
  const formData = await req.formData()
  const file = formData.get('file') as File

  // Validation
  if (file.size > 20 * 1024 * 1024) return error('File too large (max 20MB)')
  if (file.type !== 'application/pdf') return error('Only PDF supported')

  // Sauvegarder le fichier
  const dir = `data/uploads/attachments/${noteId}`
  fs.mkdirSync(dir, { recursive: true })
  const filePath = path.join(dir, `${uuid()}.pdf`)
  fs.writeFileSync(filePath, Buffer.from(await file.arrayBuffer()))

  // Créer l'attachment
  const attachment = await prisma.noteAttachment.create({
    data: {
      noteId,
      fileName: file.name,
      fileType: file.type,
      fileSize: file.size,
      filePath,
      mimeType: file.type,
      status: 'pending',
    },
  })

  // Lancer l'ingestion en arrière-plan (setImmediate)
  setImmediate(() => documentIngestionService.ingest(attachment.id))

  return NextResponse.json({ success: true, data: attachment })
}
```

---

## C. Interface du Nouvel Outil Agent — `document_search`

### C1. Enregistrement dans le registre

```typescript
// lib/ai/tools/document-search.tool.ts

toolRegistry.register({
  name: 'document_search',
  description: 'Search within PDF documents attached to notes. Returns relevant passages with page numbers and source document info.',
  isInternal: true,
  buildTool: (ctx) =>
    tool({
      description: `Search within PDF documents attached to the user's notes.
Returns matching passages with page numbers, chunk content, and the source note/document info.
Use this when the user asks about specific documents, PDFs, or attached files.
Can search across all documents or within a specific note's attachments.`,
      inputSchema: z.object({
        query: z.string().describe('The search query to find relevant passages in documents'),
        noteId: z.string().optional().describe('Optional: restrict search to attachments of a specific note'),
        limit: z.number().optional().describe('Max results to return (default 5)').default(5),
      }),
      execute: async ({ query, noteId, limit = 5 }) => {
        try {
          const queryEmbedding = await embeddingService.generateEmbedding(query)
          const vectorStr = embeddingService.toVectorString(queryEmbedding.embedding)

          let noteFilter = ''
          const params: any[] = [vectorStr, limit]

          if (noteId) {
            assertSafeId(noteId, 'noteId')
            noteFilter = `AND na."noteId" = $${params.length}`
            params.push(noteId)
          } else if (ctx.notebookId) {
            assertSafeId(ctx.notebookId, 'notebookId')
            noteFilter = `AND n."notebookId" = $${params.length}`
            params.push(ctx.notebookId)
          }

          const userId = ctx.userId
          assertSafeId(userId, 'userId')
          params.push(userId)

          const results = await prisma.$queryRawUnsafe(
            `SELECT
              dc.id as chunkId,
              dc.content,
              dc."pageNumber",
              dc."chunkIndex",
              dc.metadata,
              na.id as "attachmentId",
              na."fileName",
              na."pageCount",
              na."noteId",
              n.title as "noteTitle",
              dc.embedding::text <=> $1::vector as distance
            FROM "DocumentChunk" dc
            JOIN "NoteAttachment" na ON na.id = dc."attachmentId"
            JOIN "Note" n ON n.id = na."noteId"
            WHERE dc.embedding IS NOT NULL
              AND na.status = 'ready'
              AND n."trashedAt" IS NULL
              AND n."userId" = $${params.length}
              ${noteFilter}
            ORDER BY dc.embedding::text <=> $1::vector
            LIMIT $2`,
            ...params
          ) as any[]

          if (!results.length) return { results: [], message: 'No matching documents found' }

          const threshold = 0.5
          return results
            .filter(r => r.distance < threshold)
            .map(r => ({
              content: r.content.substring(0, 600),
              pageNumber: r.pageNumber,
              chunkIndex: r.chunkIndex,
              fileName: r.fileName,
              noteId: r.noteId,
              noteTitle: r.noteTitle || 'Untitled',
              score: Math.max(0, 1 - r.distance),
            }))
        } catch (e: any) {
          return { error: `Document search failed: ${e.message}` }
        }
      },
    }),
})
```

### C2. Auto-enregistrement

Ajout dans `lib/ai/tools/index.ts` :

```typescript
import './document-search'
```

### C3. Activation dans le Chat

Mise à jour de `registry.ts` — `buildToolsForChat` :

```typescript
buildToolsForChat(ctx: ToolContext): Tool[] {
  const tools: Tool[] = []
  tools.push(this.build('note_search', ctx))
  tools.push(this.build('note_read', ctx))
  tools.push(this.build('document_search', ctx))  // <-- NOUVEAU
  if (ctx.webSearch) {
    tools.push(this.build('web_search', ctx))
    tools.push(this.build('web_scrape', ctx))
  }
  return tools
}
```

---

## D. Logique de Requêtage RAG

### D1. Recherche hybride étendue — `semantic-search.service.ts`

Ajout d'une méthode `searchWithDocuments` qui combine notes ET chunks de documents :

```typescript
async searchWithDocuments(
  userId: string,
  query: string,
  options?: SearchOptions & { noteId?: string; includeDocuments?: boolean }
): Promise<(SearchResult & { source?: 'note' | 'document'; pageNumber?: number; fileName?: string })[]> {
  const includeDocuments = options?.includeDocuments !== false

  // Phase 1: Recherche notes existante (FTS + pgvector + RRF)
  const noteResults = await this.searchAsUser(userId, query, options)

  // Phase 2: Recherche dans les documents (pgvector uniquement)
  let documentResults: any[] = []
  if (includeDocuments) {
    const queryEmbedding = await embeddingService.generateEmbedding(query)
    const vectorStr = embeddingService.toVectorString(queryEmbedding.embedding)

    const params: any[] = [vectorStr, 50, userId]
    let noteFilter = ''
    if (options?.noteId) {
      assertSafeId(options.noteId, 'noteId')
      noteFilter = `AND na."noteId" = $${params.length + 1}`
      params.push(options.noteId)
    }
    if (options?.notebookId) {
      assertSafeId(options.notebookId, 'notebookId')
      noteFilter += ` AND n."notebookId" = $${params.length + 1}`
      params.push(options.notebookId)
    }

    documentResults = await prisma.$queryRawUnsafe(
      `SELECT
        dc.content,
        dc."pageNumber",
        na."fileName",
        na."noteId",
        n.title as "noteTitle",
        1 - (dc.embedding::text <=> $1::vector) as score
      FROM "DocumentChunk" dc
      JOIN "NoteAttachment" na ON na.id = dc."attachmentId"
      JOIN "Note" n ON n.id = na."noteId"
      WHERE dc.embedding IS NOT NULL
        AND na.status = 'ready'
        AND n."trashedAt" IS NULL
        AND n."userId" = $3
        ${noteFilter}
      ORDER BY dc.embedding::text <=> $1::vector
      LIMIT $2`,
      ...params
    ) as any[]
  }

  // Phase 3: Fusion RRF entre notes et documents
  const K = 60
  const fused = new Map<string, any>()

  for (let i = 0; i < noteResults.length; i++) {
    const r = noteResults[i]
    fused.set(r.noteId, {
      ...r,
      source: 'note',
      rrfScore: 1 / (K + i + 1),
    })
  }

  for (let i = 0; i < documentResults.length; i++) {
    const r = documentResults[i]
    const key = `doc_${r.noteId}_${r.pageNumber}_${i}`
    fused.set(key, {
      noteId: r.noteId,
      title: `${r.noteTitle || 'Untitled'} → ${r.fileName} (p.${r.pageNumber})`,
      content: r.content.substring(0, 500),
      score: r.score,
      matchType: 'related',
      source: 'document',
      pageNumber: r.pageNumber,
      fileName: r.fileName,
      rrfScore: 1 / (K + i + 1),
    })
  }

  return Array.from(fused.values())
    .sort((a, b) => b.rrfScore - a.rrfScore)
    .slice(0, options?.limit || 20)
}
```

### D2. Logique de priorisation dans le Chat RAG

Mise à jour de `app/api/chat/route.ts` :

```typescript
// Dans le handler du chat, avant d'injecter le contexte :

let contextNotes = ''

// Si l'utilisateur mentionne un document/PDF spécifique
const documentMention = userMessage.match(
  /\b(pdf|document|fichier|pi[eè]ce jointe|attachment|file)\b/i
)
const specificNote = userMessage.match(
  /(?:dans|sur|de|du|la|le) (?:cette note|ce document|cette page)/i
)

if (specificNote && notebookId) {
  // MODE CIBLE : chercher SEULEMENT dans les documents de cette note
  const docResults = await semanticSearchService.searchWithDocuments(
    userId, userMessage, { noteId: currentNoteId, includeDocuments: true, limit: 5 }
  )
  contextNotes = docResults.map(r =>
    r.source === 'document'
      ? `[DOCUMENT: ${r.fileName} p.${r.pageNumber}]\n${r.content}`
      : `[NOTE: ${r.title}]\n${r.content}`
  ).join('\n\n---\n\n')
} else {
  // MODE GLOBAL : recherche étendue notes + documents
  const results = await semanticSearchService.searchWithDocuments(
    userId, userMessage, { notebookId, includeDocuments: !!documentMention, limit: 10 }
  )
  contextNotes = results.map(r =>
    r.source === 'document'
      ? `[DOCUMENT: ${r.fileName} p.${r.pageNumber}]\n${r.content}`
      : `[NOTE: ${r.title}]\n${r.content}`
  ).join('\n\n---\n\n')
}
```

### D3. Prompt système mis à jour

```typescript
const systemPrompt = `Tu es l'IA Note de Memento, l'assistant intelligent de prise de notes.

CONTEXTES DISPONIBLES :
- [NOTE: titre] → contenu d'une note de l'utilisateur
- [DOCUMENT: fichier.pdf p.X] → passage extrait d'un PDF attaché à une note

RÈGLES POUR LES DOCUMENTS :
- Cite toujours le nom du fichier et le numéro de page quand tu te réfères à un document
- Si l'utilisateur pose une question sur "ce document" ou "le PDF", base ta réponse uniquement sur les passages [DOCUMENT]
- Si les passages sont insuffisants, dis-le clairement plutôt que de deviner
- Pour les tableaux et données chiffrées, reproduis-les fidèlement

...`
```

### D4. SQL — Requête de débogage / test

```sql
-- Test : recherche dans les chunks d'un document spécifique
SELECT
  dc.content,
  dc."pageNumber",
  dc."chunkIndex",
  na."fileName",
  n.title as note_title,
  dc.embedding::text <=> '[0.01, 0.02, ...]'::vector as distance
FROM "DocumentChunk" dc
JOIN "NoteAttachment" na ON na.id = dc."attachmentId"
JOIN "Note" n ON n.id = na."noteId"
WHERE na.status = 'ready'
  AND n."trashedAt" IS NULL
ORDER BY dc.embedding::text <=> '[0.01, 0.02, ...]'::vector
LIMIT 10;
```

---

## Résumé des fichiers à créer/modifier

| Action | Fichier |
|---|---|
| **CRÉER** | `prisma/migrations/XXX_add_note_attachment_document_chunk/migration.sql` |
| **MODIFIER** | `prisma/schema.prisma` — ajouter NoteAttachment, DocumentChunk, relation sur Note |
| **CRÉER** | `lib/ai/services/document-extraction.service.ts` |
| **CRÉER** | `lib/ai/services/document-chunking.service.ts` |
| **CRÉER** | `lib/ai/services/document-ingestion.service.ts` |
| **CRÉER** | `lib/ai/tools/document-search.tool.ts` |
| **MODIFIER** | `lib/ai/tools/index.ts` — ajouter import document-search |
| **MODIFIER** | `lib/ai/tools/registry.ts` — ajouter document_search dans buildToolsForChat |
| **CRÉER** | `app/api/notes/[noteId]/attachments/route.ts` — upload |
| **CRÉER** | `app/api/notes/[noteId]/attachments/[attachmentId]/route.ts` — GET status, DELETE |
| **MODIFIER** | `lib/ai/services/semantic-search.service.ts` — ajouter searchWithDocuments |
| **MODIFIER** | `app/api/chat/route.ts` — contexte documents dans le RAG |