Files
Momento/docs/spec-document-qa.md
Antigravity 1fcea6ed7d
All checks were successful
Deploy to Production / Build and Deploy (push) Successful in 7s
feat: brainstorm sessions, PDF document Q&A, embedding fixes, and UI improvements
- Add brainstorm feature with collaborative canvas, AI idea generation, live cursors, playback, and export
- Add PDF upload/extraction/ingestion pipeline with pgvector document search (RAG)
- Add document Q&A overlay with streaming chat and PDF preview
- Add note attachments UI with status polling, grid layout, and auto-scroll
- Add task extraction AI tool and agent executor improvements
- Fix NoteEmbedding missing updatedAt column, re-index 66 notes with 1536-dim embeddings
- Fix brainstorm 'Create Note' button: add success toast and redirect to created note
- Fix memory echo notification infinite polling
- Fix chat route to always include document_search tool
- Add brainstorm i18n keys across all 14 locales
- Add socket server for real-time brainstorm collaboration
- Add hierarchical notebook selector and organize notebook dialog improvements
- Add sidebar brainstorm section with session management
- Update prisma schema with brainstorm tables, attachments, and document chunks
2026-05-14 17:43:21 +00:00

698 lines
21 KiB
Markdown

# Spécification Technique — Document Parsing & Q&A (Analyse PDF)
## A. Mises à jour du Schéma Prisma
### A1. Modèle `NoteAttachment`
Stocke les fichiers attachés à une note (PDF, images, documents).
```prisma
model NoteAttachment {
id String @id @default(cuid())
noteId String
fileName String
fileType String // "application/pdf", "image/png", etc.
fileSize Int // en bytes
filePath String // chemin local: data/uploads/attachments/{noteId}/{uuid}.pdf
mimeType String // redondant avec fileType pour requêtes rapides
status String @default("pending") // pending → processing → ready → failed
pageCount Int? // nombre de pages (PDF uniquement)
error String? // message d'erreur si failed
createdAt DateTime @default(now())
updatedAt DateTime @updatedAt
note Note @relation(fields: [noteId], references: [id], onDelete: Cascade)
chunks DocumentChunk[]
@@index([noteId])
@@index([status])
}
```
### A2. Modèle `DocumentChunk`
Fragments vectorisés d'un document. Chaque chunk est lié à un attachment ET transitivement à une note.
```prisma
model DocumentChunk {
id String @id @default(cuid())
attachmentId String
content String // texte du fragment (800-1200 tokens)
chunkIndex Int // position ordinale dans le document (0, 1, 2…)
pageNumber Int? // page source (pour citation)
startChar Int? // offset caractère de début dans le texte extrait
endChar Int? // offset caractère de fin
metadata String? // JSON: { heading, section, tableCaption… }
embedding Unsupported("vector(1536)")?
createdAt DateTime @default(now())
attachment NoteAttachment @relation(fields: [attachmentId], references: [id], onDelete: Cascade)
@@index([attachmentId])
@@index([attachmentId, chunkIndex])
}
```
### A3. Ajout à `Note`
```prisma
model Note {
// … champs existants …
attachments NoteAttachment[]
}
```
### A4. Migration SQL brute — Index HNSW pour DocumentChunk
```sql
-- À ajouter dans la migration Prisma (migration.sql)
CREATE INDEX IF NOT EXISTS "DocumentChunk_embedding_hnsw_idx"
ON "DocumentChunk" USING hnsw ("embedding" vector_cosine_ops)
WITH (m = 16, ef_construction = 64);
```
---
## B. Pipeline d'Ingestion (Chunking & Embeddings)
### B1. Architecture du pipeline
```
PDF upload → NoteAttachment (status: pending)
pdf-parse extraction (texte brut + métadonnées pages)
Structural Chunking (800 chars, overlap 200, respect des pages)
DocumentChunk.create (content, chunkIndex, pageNumber, metadata)
Batch embeddings (Promise.all par batch de 20)
SQL UPDATE embedding sur chaque chunk
NoteAttachment.update (status: ready)
```
### B2. Service d'extraction — `document-extraction.service.ts`
```typescript
// lib/ai/services/document-extraction.service.ts
import pdf from 'pdf-parse'
interface ExtractedPage {
pageNumber: number
text: string
}
interface ExtractedDocument {
pages: ExtractedPage[]
totalPages: number
metadata: { title?: string; author?: string }
}
export class DocumentExtractionService {
async extractPdf(filePath: string): Promise<ExtractedDocument> {
const dataBuffer = fs.readFileSync(filePath)
const data = await pdf(dataBuffer, {
max: 0, // toutes les pages
})
// pdf-parse ne donne pas les pages directement,
// on utilise un custom page renderer
const pages: ExtractedPage[] = []
let currentPage = 0
// NLP page renderer: each page separated
const renderer = {
renderPage: (pageData: any) => {
currentPage++
const text = pageData.text
pages.push({ pageNumber: currentPage, text })
return ''
}
}
// Re-parse avec le renderer
await pdf(dataBuffer, { pagerender: renderer.renderPage })
return {
pages,
totalPages: data.numpages,
metadata: {
title: data.info?.Title,
author: data.info?.Author,
},
}
}
}
export const documentExtractionService = new DocumentExtractionService()
```
### B3. Stratégie de Chunking — `document-chunking.service.ts`
**Principes :**
1. **Taille cible** : 800 caractères (~200 tokens), avec overlap de 200 caractères
2. **Respect des frontières de page** : un chunk ne chevauche JAMAIS deux pages. Si la coupure tombe au milieu d'une page, on ajuste.
3. **Respect des sections** : les headings (lignes en MAJUSCULES ou préfixées par `#`, `##`) démarrent un nouveau chunk
4. **Overlap contextuel** : les 200 derniers caractères du chunk N sont répétés au début du chunk N+1
5. **Tables** : conservées en entier dans un seul chunk si < 1500 chars, sinon découpées par ligne avec en-tête répété
```typescript
// lib/ai/services/document-chunking.service.ts
interface ChunkInput {
text: string
pageNumber: number
}
interface DocumentChunkData {
content: string
chunkIndex: number
pageNumber: number
startChar: number
endChar: number
metadata?: string
}
export class DocumentChunkingService {
private readonly CHUNK_SIZE = 800
private readonly OVERLAP = 200
private readonly MAX_CHUNK_SIZE = 1500
chunk(pages: ChunkInput[]): DocumentChunkData[] {
const chunks: DocumentChunkData[] = []
let globalIndex = 0
let previousTail = ''
for (const page of pages) {
const text = page.text.trim()
if (!text) continue
// Découper en sections (par headings ou paragraphes)
const sections = this.splitSections(text)
let buffer = previousTail
let bufferStart = 0
for (const section of sections) {
if (buffer.length + section.length > this.CHUNK_SIZE && buffer.length > 0) {
// Flush le buffer comme un chunk
chunks.push({
content: buffer.trim(),
chunkIndex: globalIndex++,
pageNumber: page.pageNumber,
startChar: bufferStart,
endChar: bufferStart + buffer.length,
})
// Overlap: garder les derniers OVERLAP chars
previousTail = buffer.slice(-this.OVERLAP)
buffer = previousTail + '\n' + section
bufferStart += buffer.length - section.length - previousTail.length
} else {
buffer += (buffer ? '\n\n' : '') + section
}
}
// Flush le reste
if (buffer.trim()) {
chunks.push({
content: buffer.trim(),
chunkIndex: globalIndex++,
pageNumber: page.pageNumber,
startChar: bufferStart,
endChar: bufferStart + buffer.length,
})
previousTail = buffer.slice(-this.OVERLAP)
}
}
return chunks
}
private splitSections(text: string): string[] {
const lines = text.split('\n')
const sections: string[] = []
let current = ''
for (const line of lines) {
const isHeading = /^(#{1,6}\s|[A-Z][A-Z\s]{5,}$)/.test(line.trim())
if (isHeading && current.trim()) {
sections.push(current.trim())
current = line
} else {
current += (current ? '\n' : '') + line
}
}
if (current.trim()) sections.push(current.trim())
return sections
}
}
export const documentChunkingService = new DocumentChunkingService()
```
### B4. Service d'ingestion orchestrateur — `document-ingestion.service.ts`
```typescript
// lib/ai/services/document-ingestion.service.ts
export class DocumentIngestionService {
async ingest(attachmentId: string): Promise<void> {
const attachment = await prisma.noteAttachment.findUnique({
where: { id: attachmentId },
})
if (!attachment) throw new Error('Attachment not found')
await prisma.noteAttachment.update({
where: { id: attachmentId },
data: { status: 'processing' },
})
try {
// 1. Extraction
const extracted = await documentExtractionService.extractPdf(attachment.filePath)
await prisma.noteAttachment.update({
where: { id: attachmentId },
data: { pageCount: extracted.totalPages },
})
// 2. Chunking
const chunkInputs = extracted.pages.map(p => ({
text: p.text,
pageNumber: p.pageNumber,
}))
const chunks = documentChunkingService.chunk(chunkInputs)
// 3. Créer les chunks en DB (sans embedding)
const created = await Promise.all(
chunks.map(c =>
prisma.documentChunk.create({
data: {
attachmentId,
content: c.content,
chunkIndex: c.chunkIndex,
pageNumber: c.pageNumber,
startChar: c.startChar,
endChar: c.endChar,
metadata: c.metadata,
},
})
)
)
// 4. Batch embeddings (par batch de 20)
const BATCH_SIZE = 20
for (let i = 0; i < created.length; i += BATCH_SIZE) {
const batch = created.slice(i, i + BATCH_SIZE)
const texts = batch.map(c => c.content)
const embeddings = await embeddingService.generateBatchEmbeddings(texts)
await Promise.all(
batch.map((chunk, idx) =>
prisma.$executeRawUnsafe(
`UPDATE "DocumentChunk" SET embedding = $1::vector WHERE id = $2`,
embeddingService.toVectorString(embeddings[idx].embedding),
chunk.id
)
)
)
}
// 5. Marquer prêt
await prisma.noteAttachment.update({
where: { id: attachmentId },
data: { status: 'ready' },
})
} catch (error: any) {
await prisma.noteAttachment.update({
where: { id: attachmentId },
data: { status: 'failed', error: error.message },
})
throw error
}
}
}
export const documentIngestionService = new DocumentIngestionService()
```
### B5. Route API d'upload
```typescript
// app/api/notes/[noteId]/attachments/route.ts
export async function POST(req, { params }) {
const session = await auth()
if (!session?.user?.id) return unauthorized()
const { noteId } = await params
const formData = await req.formData()
const file = formData.get('file') as File
// Validation
if (file.size > 20 * 1024 * 1024) return error('File too large (max 20MB)')
if (file.type !== 'application/pdf') return error('Only PDF supported')
// Sauvegarder le fichier
const dir = `data/uploads/attachments/${noteId}`
fs.mkdirSync(dir, { recursive: true })
const filePath = path.join(dir, `${uuid()}.pdf`)
fs.writeFileSync(filePath, Buffer.from(await file.arrayBuffer()))
// Créer l'attachment
const attachment = await prisma.noteAttachment.create({
data: {
noteId,
fileName: file.name,
fileType: file.type,
fileSize: file.size,
filePath,
mimeType: file.type,
status: 'pending',
},
})
// Lancer l'ingestion en arrière-plan (setImmediate)
setImmediate(() => documentIngestionService.ingest(attachment.id))
return NextResponse.json({ success: true, data: attachment })
}
```
---
## C. Interface du Nouvel Outil Agent — `document_search`
### C1. Enregistrement dans le registre
```typescript
// lib/ai/tools/document-search.tool.ts
toolRegistry.register({
name: 'document_search',
description: 'Search within PDF documents attached to notes. Returns relevant passages with page numbers and source document info.',
isInternal: true,
buildTool: (ctx) =>
tool({
description: `Search within PDF documents attached to the user's notes.
Returns matching passages with page numbers, chunk content, and the source note/document info.
Use this when the user asks about specific documents, PDFs, or attached files.
Can search across all documents or within a specific note's attachments.`,
inputSchema: z.object({
query: z.string().describe('The search query to find relevant passages in documents'),
noteId: z.string().optional().describe('Optional: restrict search to attachments of a specific note'),
limit: z.number().optional().describe('Max results to return (default 5)').default(5),
}),
execute: async ({ query, noteId, limit = 5 }) => {
try {
const queryEmbedding = await embeddingService.generateEmbedding(query)
const vectorStr = embeddingService.toVectorString(queryEmbedding.embedding)
let noteFilter = ''
const params: any[] = [vectorStr, limit]
if (noteId) {
assertSafeId(noteId, 'noteId')
noteFilter = `AND na."noteId" = $${params.length}`
params.push(noteId)
} else if (ctx.notebookId) {
assertSafeId(ctx.notebookId, 'notebookId')
noteFilter = `AND n."notebookId" = $${params.length}`
params.push(ctx.notebookId)
}
const userId = ctx.userId
assertSafeId(userId, 'userId')
params.push(userId)
const results = await prisma.$queryRawUnsafe(
`SELECT
dc.id as chunkId,
dc.content,
dc."pageNumber",
dc."chunkIndex",
dc.metadata,
na.id as "attachmentId",
na."fileName",
na."pageCount",
na."noteId",
n.title as "noteTitle",
dc.embedding::text <=> $1::vector as distance
FROM "DocumentChunk" dc
JOIN "NoteAttachment" na ON na.id = dc."attachmentId"
JOIN "Note" n ON n.id = na."noteId"
WHERE dc.embedding IS NOT NULL
AND na.status = 'ready'
AND n."trashedAt" IS NULL
AND n."userId" = $${params.length}
${noteFilter}
ORDER BY dc.embedding::text <=> $1::vector
LIMIT $2`,
...params
) as any[]
if (!results.length) return { results: [], message: 'No matching documents found' }
const threshold = 0.5
return results
.filter(r => r.distance < threshold)
.map(r => ({
content: r.content.substring(0, 600),
pageNumber: r.pageNumber,
chunkIndex: r.chunkIndex,
fileName: r.fileName,
noteId: r.noteId,
noteTitle: r.noteTitle || 'Untitled',
score: Math.max(0, 1 - r.distance),
}))
} catch (e: any) {
return { error: `Document search failed: ${e.message}` }
}
},
}),
})
```
### C2. Auto-enregistrement
Ajout dans `lib/ai/tools/index.ts` :
```typescript
import './document-search'
```
### C3. Activation dans le Chat
Mise à jour de `registry.ts``buildToolsForChat` :
```typescript
buildToolsForChat(ctx: ToolContext): Tool[] {
const tools: Tool[] = []
tools.push(this.build('note_search', ctx))
tools.push(this.build('note_read', ctx))
tools.push(this.build('document_search', ctx)) // <-- NOUVEAU
if (ctx.webSearch) {
tools.push(this.build('web_search', ctx))
tools.push(this.build('web_scrape', ctx))
}
return tools
}
```
---
## D. Logique de Requêtage RAG
### D1. Recherche hybride étendue — `semantic-search.service.ts`
Ajout d'une méthode `searchWithDocuments` qui combine notes ET chunks de documents :
```typescript
async searchWithDocuments(
userId: string,
query: string,
options?: SearchOptions & { noteId?: string; includeDocuments?: boolean }
): Promise<(SearchResult & { source?: 'note' | 'document'; pageNumber?: number; fileName?: string })[]> {
const includeDocuments = options?.includeDocuments !== false
// Phase 1: Recherche notes existante (FTS + pgvector + RRF)
const noteResults = await this.searchAsUser(userId, query, options)
// Phase 2: Recherche dans les documents (pgvector uniquement)
let documentResults: any[] = []
if (includeDocuments) {
const queryEmbedding = await embeddingService.generateEmbedding(query)
const vectorStr = embeddingService.toVectorString(queryEmbedding.embedding)
const params: any[] = [vectorStr, 50, userId]
let noteFilter = ''
if (options?.noteId) {
assertSafeId(options.noteId, 'noteId')
noteFilter = `AND na."noteId" = $${params.length + 1}`
params.push(options.noteId)
}
if (options?.notebookId) {
assertSafeId(options.notebookId, 'notebookId')
noteFilter += ` AND n."notebookId" = $${params.length + 1}`
params.push(options.notebookId)
}
documentResults = await prisma.$queryRawUnsafe(
`SELECT
dc.content,
dc."pageNumber",
na."fileName",
na."noteId",
n.title as "noteTitle",
1 - (dc.embedding::text <=> $1::vector) as score
FROM "DocumentChunk" dc
JOIN "NoteAttachment" na ON na.id = dc."attachmentId"
JOIN "Note" n ON n.id = na."noteId"
WHERE dc.embedding IS NOT NULL
AND na.status = 'ready'
AND n."trashedAt" IS NULL
AND n."userId" = $3
${noteFilter}
ORDER BY dc.embedding::text <=> $1::vector
LIMIT $2`,
...params
) as any[]
}
// Phase 3: Fusion RRF entre notes et documents
const K = 60
const fused = new Map<string, any>()
for (let i = 0; i < noteResults.length; i++) {
const r = noteResults[i]
fused.set(r.noteId, {
...r,
source: 'note',
rrfScore: 1 / (K + i + 1),
})
}
for (let i = 0; i < documentResults.length; i++) {
const r = documentResults[i]
const key = `doc_${r.noteId}_${r.pageNumber}_${i}`
fused.set(key, {
noteId: r.noteId,
title: `${r.noteTitle || 'Untitled'}${r.fileName} (p.${r.pageNumber})`,
content: r.content.substring(0, 500),
score: r.score,
matchType: 'related',
source: 'document',
pageNumber: r.pageNumber,
fileName: r.fileName,
rrfScore: 1 / (K + i + 1),
})
}
return Array.from(fused.values())
.sort((a, b) => b.rrfScore - a.rrfScore)
.slice(0, options?.limit || 20)
}
```
### D2. Logique de priorisation dans le Chat RAG
Mise à jour de `app/api/chat/route.ts` :
```typescript
// Dans le handler du chat, avant d'injecter le contexte :
let contextNotes = ''
// Si l'utilisateur mentionne un document/PDF spécifique
const documentMention = userMessage.match(
/\b(pdf|document|fichier|pi[eè]ce jointe|attachment|file)\b/i
)
const specificNote = userMessage.match(
/(?:dans|sur|de|du|la|le) (?:cette note|ce document|cette page)/i
)
if (specificNote && notebookId) {
// MODE CIBLE : chercher SEULEMENT dans les documents de cette note
const docResults = await semanticSearchService.searchWithDocuments(
userId, userMessage, { noteId: currentNoteId, includeDocuments: true, limit: 5 }
)
contextNotes = docResults.map(r =>
r.source === 'document'
? `[DOCUMENT: ${r.fileName} p.${r.pageNumber}]\n${r.content}`
: `[NOTE: ${r.title}]\n${r.content}`
).join('\n\n---\n\n')
} else {
// MODE GLOBAL : recherche étendue notes + documents
const results = await semanticSearchService.searchWithDocuments(
userId, userMessage, { notebookId, includeDocuments: !!documentMention, limit: 10 }
)
contextNotes = results.map(r =>
r.source === 'document'
? `[DOCUMENT: ${r.fileName} p.${r.pageNumber}]\n${r.content}`
: `[NOTE: ${r.title}]\n${r.content}`
).join('\n\n---\n\n')
}
```
### D3. Prompt système mis à jour
```typescript
const systemPrompt = `Tu es l'IA Note de Memento, l'assistant intelligent de prise de notes.
CONTEXTES DISPONIBLES :
- [NOTE: titre] → contenu d'une note de l'utilisateur
- [DOCUMENT: fichier.pdf p.X] → passage extrait d'un PDF attaché à une note
RÈGLES POUR LES DOCUMENTS :
- Cite toujours le nom du fichier et le numéro de page quand tu te réfères à un document
- Si l'utilisateur pose une question sur "ce document" ou "le PDF", base ta réponse uniquement sur les passages [DOCUMENT]
- Si les passages sont insuffisants, dis-le clairement plutôt que de deviner
- Pour les tableaux et données chiffrées, reproduis-les fidèlement
...`
```
### D4. SQL — Requête de débogage / test
```sql
-- Test : recherche dans les chunks d'un document spécifique
SELECT
dc.content,
dc."pageNumber",
dc."chunkIndex",
na."fileName",
n.title as note_title,
dc.embedding::text <=> '[0.01, 0.02, ...]'::vector as distance
FROM "DocumentChunk" dc
JOIN "NoteAttachment" na ON na.id = dc."attachmentId"
JOIN "Note" n ON n.id = na."noteId"
WHERE na.status = 'ready'
AND n."trashedAt" IS NULL
ORDER BY dc.embedding::text <=> '[0.01, 0.02, ...]'::vector
LIMIT 10;
```
---
## Résumé des fichiers à créer/modifier
| Action | Fichier |
|---|---|
| **CRÉER** | `prisma/migrations/XXX_add_note_attachment_document_chunk/migration.sql` |
| **MODIFIER** | `prisma/schema.prisma` — ajouter NoteAttachment, DocumentChunk, relation sur Note |
| **CRÉER** | `lib/ai/services/document-extraction.service.ts` |
| **CRÉER** | `lib/ai/services/document-chunking.service.ts` |
| **CRÉER** | `lib/ai/services/document-ingestion.service.ts` |
| **CRÉER** | `lib/ai/tools/document-search.tool.ts` |
| **MODIFIER** | `lib/ai/tools/index.ts` — ajouter import document-search |
| **MODIFIER** | `lib/ai/tools/registry.ts` — ajouter document_search dans buildToolsForChat |
| **CRÉER** | `app/api/notes/[noteId]/attachments/route.ts` — upload |
| **CRÉER** | `app/api/notes/[noteId]/attachments/[attachmentId]/route.ts` — GET status, DELETE |
| **MODIFIER** | `lib/ai/services/semantic-search.service.ts` — ajouter searchWithDocuments |
| **MODIFIER** | `app/api/chat/route.ts` — contexte documents dans le RAG |