feat: migrate semantic search to pgvector + full-text search
All checks were successful
Deploy to Production / Build and Deploy (push) Successful in 2m12s
All checks were successful
Deploy to Production / Build and Deploy (push) Successful in 2m12s
Replace JSON-string embeddings with native pgvector(1536) storage and add PostgreSQL full-text search (tsvector/GIN) with Reciprocal Rank Fusion for hybrid keyword + semantic ranking. Changes: - NoteEmbedding.embedding: String → vector(1536) via pgvector - NoteEmbedding: added updatedAt for reindex tracking - Note: added tsv (tsvector) with auto-update trigger for FTS - semantic-search.service: hybrid FTS + vector search with RRF fusion - embedding.service: toVectorString() for pgvector SQL literals - Removed JS-side cosine similarity loops (now DB-side via <=>) - Added HNSW index on NoteEmbedding.embedding (cosine distance) - Added GIN index on Note.tsv for FTS queries Schema migration in: prisma/migrations/20260512120000_pgvector_and_fts_search/ Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
This commit is contained in:
133
MIGRATION.md
Normal file
133
MIGRATION.md
Normal file
@@ -0,0 +1,133 @@
|
||||
# Semantic Search Migration to pgvector + Full-Text Search
|
||||
|
||||
## Overview
|
||||
|
||||
This migration migrates the semantic search infrastructure from JSON-string embeddings to **native pgvector** storage and adds **PostgreSQL full-text search (FTS)** with Reciprocal Rank Fusion (RRF) for hybrid ranking.
|
||||
|
||||
## What Changed
|
||||
|
||||
### Schema Changes
|
||||
|
||||
**`NoteEmbedding` table:**
|
||||
- Changed `embedding String` → `embedding Unsupported("vector(1536))` — stores as native pgvector
|
||||
- Added `updatedAt` column for tracking reindex freshness
|
||||
|
||||
**`Note` table:**
|
||||
- Added `tsv Unsupported("tsvector")` — auto-updated via trigger for FTS
|
||||
|
||||
### Search Architecture
|
||||
|
||||
| Before | After |
|
||||
|--------|-------|
|
||||
| JS-side cosine similarity loops | DB-side `<=>` (cosine distance) via pgvector |
|
||||
| Embeddings stored as JSON strings | Native `vector(1536)` pgvector type |
|
||||
| Pure vector-only search | Hybrid FTS + vector with RRF fusion |
|
||||
| No full-text capability | `tsvector` + GIN index for keyword matching |
|
||||
|
||||
### New Indexes
|
||||
|
||||
- `NoteEmbedding_embedding_hnsw_idx` — HNSW index on `embedding` column (cosine distance)
|
||||
- `Note_tsv_gin_idx` — GIN index on `tsv` column for FTS
|
||||
|
||||
## Deployment Steps
|
||||
|
||||
### 1. Enable pgvector Extension
|
||||
|
||||
pgvector must be enabled before the schema migration runs:
|
||||
|
||||
```sql
|
||||
CREATE EXTENSION IF NOT EXISTS vector;
|
||||
```
|
||||
|
||||
If deploying via the migration file, this runs automatically as Phase 1.
|
||||
|
||||
### 2. Run Database Migration
|
||||
|
||||
The migration file (`prisma/migrations/20260512120000_pgvector_and_fts_search/migration.sql`) applies in three phases:
|
||||
|
||||
**Phase 1:** Enable pgvector extension
|
||||
```sql
|
||||
CREATE EXTENSION IF NOT EXISTS vector;
|
||||
```
|
||||
|
||||
**Phase 2:** Convert NoteEmbedding to native vector
|
||||
```sql
|
||||
ALTER TABLE "NoteEmbedding" ADD COLUMN "vec" vector(1536);
|
||||
UPDATE "NoteEmbedding" SET "vec" = ("embedding"::jsonb)::text::vector(1536)
|
||||
WHERE "embedding" IS NOT NULL;
|
||||
ALTER TABLE "NoteEmbedding" DROP COLUMN "embedding";
|
||||
ALTER TABLE "NoteEmbedding" RENAME COLUMN "vec" TO "embedding";
|
||||
ALTER TABLE "NoteEmbedding" ADD COLUMN "updatedAt" TIMESTAMP NOT NULL DEFAULT now();
|
||||
CREATE INDEX "NoteEmbedding_embedding_hnsw_idx" ON "NoteEmbedding"
|
||||
USING hnsw ("embedding" vector_cosine_ops) WITH (m = 16, ef_construction = 64);
|
||||
```
|
||||
|
||||
**Phase 3:** Add FTS tsvector to Note
|
||||
```sql
|
||||
ALTER TABLE "Note" ADD COLUMN "tsv" tsvector;
|
||||
UPDATE "Note" SET "tsv" =
|
||||
setweight(to_tsvector('simple', COALESCE("title", '')), 'A') ||
|
||||
setweight(to_tsvector('simple', COALESCE("content", '')), 'B');
|
||||
CREATE INDEX "Note_tsv_gin_idx" ON "Note" USING gin ("tsv");
|
||||
CREATE OR REPLACE FUNCTION "note_tsv_trigger"() RETURNS trigger AS $$
|
||||
BEGIN
|
||||
NEW."tsv" :=
|
||||
setweight(to_tsvector('simple', COALESCE(NEW."title", '')), 'A') ||
|
||||
setweight(to_tsvector('simple', COALESCE(NEW."content", '')), 'B');
|
||||
RETURN NEW;
|
||||
END;
|
||||
$$ LANGUAGE plpgsql;
|
||||
CREATE TRIGGER "note_tsv_update"
|
||||
BEFORE INSERT OR UPDATE OF "title", "content" ON "Note"
|
||||
FOR EACH ROW EXECUTE FUNCTION "note_tsv_trigger"();
|
||||
```
|
||||
|
||||
### 3. Regenerate Embeddings for Existing Notes
|
||||
|
||||
After the migration, all existing `NoteEmbedding` rows must have their `embedding` column regenerated from the old JSON strings to native vector format. The migration handles this conversion automatically via the `UPDATE` statement.
|
||||
|
||||
To reindex all notes programmatically:
|
||||
```
|
||||
POST /api/notes/reindex
|
||||
```
|
||||
|
||||
### 4. Verify Deployment
|
||||
|
||||
**Validate embeddings:**
|
||||
```
|
||||
POST /api/admin/embeddings/validate
|
||||
```
|
||||
|
||||
**Test semantic search** via the note search tool or:
|
||||
```
|
||||
POST /api/notes/search?q=<query>
|
||||
```
|
||||
|
||||
## Docker Deployment
|
||||
|
||||
The `docker-compose.yml` runs PostgreSQL 16-alpine with the following configuration:
|
||||
|
||||
- **PostgreSQL port:** 5433 (host) → 5432 (container)
|
||||
- **Database:** `memento` (user: `memento`, password: `memento` by default)
|
||||
- **Health check:** `pg_isready`
|
||||
|
||||
Services that depend on the database (`memento-note`, `mcp-server`) wait for PostgreSQL to be healthy before starting.
|
||||
|
||||
## Rollback
|
||||
|
||||
To rollback to the pre-migration state:
|
||||
|
||||
1. Drop the HNSW index: `DROP INDEX "NoteEmbedding_embedding_hnsw_idx";`
|
||||
2. Drop the GIN index: `DROP INDEX "Note_tsv_gin_idx";`
|
||||
3. Drop the trigger: `DROP TRIGGER "note_tsv_update" ON "Note";`
|
||||
4. Drop the function: `DROP FUNCTION "note_tsv_trigger"();`
|
||||
5. Revert schema via Prisma migrate reset (requires restoring the old `NoteEmbedding.embedding` column type)
|
||||
|
||||
## Affected Services
|
||||
|
||||
| Service | Container | Port |
|
||||
|---------|-----------|------|
|
||||
| PostgreSQL | `memento-postgres` | 5433 |
|
||||
| memento-note (Next.js) | `memento-web` | 3000 |
|
||||
| mcp-server | `memento-mcp` | 3001 |
|
||||
| Ollama (optional) | `memento-ollama` | 11434 |
|
||||
Reference in New Issue
Block a user