All checks were successful
Deploy to Production / Build and Deploy (push) Successful in 2m12s
Replace JSON-string embeddings with native pgvector(1536) storage and add PostgreSQL full-text search (tsvector/GIN) with Reciprocal Rank Fusion for hybrid keyword + semantic ranking. Changes: - NoteEmbedding.embedding: String → vector(1536) via pgvector - NoteEmbedding: added updatedAt for reindex tracking - Note: added tsv (tsvector) with auto-update trigger for FTS - semantic-search.service: hybrid FTS + vector search with RRF fusion - embedding.service: toVectorString() for pgvector SQL literals - Removed JS-side cosine similarity loops (now DB-side via <=>) - Added HNSW index on NoteEmbedding.embedding (cosine distance) - Added GIN index on Note.tsv for FTS queries Schema migration in: prisma/migrations/20260512120000_pgvector_and_fts_search/ Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
134 lines
4.5 KiB
Markdown
134 lines
4.5 KiB
Markdown
# Semantic Search Migration to pgvector + Full-Text Search
|
|
|
|
## Overview
|
|
|
|
This migration migrates the semantic search infrastructure from JSON-string embeddings to **native pgvector** storage and adds **PostgreSQL full-text search (FTS)** with Reciprocal Rank Fusion (RRF) for hybrid ranking.
|
|
|
|
## What Changed
|
|
|
|
### Schema Changes
|
|
|
|
**`NoteEmbedding` table:**
|
|
- Changed `embedding String` → `embedding Unsupported("vector(1536))` — stores as native pgvector
|
|
- Added `updatedAt` column for tracking reindex freshness
|
|
|
|
**`Note` table:**
|
|
- Added `tsv Unsupported("tsvector")` — auto-updated via trigger for FTS
|
|
|
|
### Search Architecture
|
|
|
|
| Before | After |
|
|
|--------|-------|
|
|
| JS-side cosine similarity loops | DB-side `<=>` (cosine distance) via pgvector |
|
|
| Embeddings stored as JSON strings | Native `vector(1536)` pgvector type |
|
|
| Pure vector-only search | Hybrid FTS + vector with RRF fusion |
|
|
| No full-text capability | `tsvector` + GIN index for keyword matching |
|
|
|
|
### New Indexes
|
|
|
|
- `NoteEmbedding_embedding_hnsw_idx` — HNSW index on `embedding` column (cosine distance)
|
|
- `Note_tsv_gin_idx` — GIN index on `tsv` column for FTS
|
|
|
|
## Deployment Steps
|
|
|
|
### 1. Enable pgvector Extension
|
|
|
|
pgvector must be enabled before the schema migration runs:
|
|
|
|
```sql
|
|
CREATE EXTENSION IF NOT EXISTS vector;
|
|
```
|
|
|
|
If deploying via the migration file, this runs automatically as Phase 1.
|
|
|
|
### 2. Run Database Migration
|
|
|
|
The migration file (`prisma/migrations/20260512120000_pgvector_and_fts_search/migration.sql`) applies in three phases:
|
|
|
|
**Phase 1:** Enable pgvector extension
|
|
```sql
|
|
CREATE EXTENSION IF NOT EXISTS vector;
|
|
```
|
|
|
|
**Phase 2:** Convert NoteEmbedding to native vector
|
|
```sql
|
|
ALTER TABLE "NoteEmbedding" ADD COLUMN "vec" vector(1536);
|
|
UPDATE "NoteEmbedding" SET "vec" = ("embedding"::jsonb)::text::vector(1536)
|
|
WHERE "embedding" IS NOT NULL;
|
|
ALTER TABLE "NoteEmbedding" DROP COLUMN "embedding";
|
|
ALTER TABLE "NoteEmbedding" RENAME COLUMN "vec" TO "embedding";
|
|
ALTER TABLE "NoteEmbedding" ADD COLUMN "updatedAt" TIMESTAMP NOT NULL DEFAULT now();
|
|
CREATE INDEX "NoteEmbedding_embedding_hnsw_idx" ON "NoteEmbedding"
|
|
USING hnsw ("embedding" vector_cosine_ops) WITH (m = 16, ef_construction = 64);
|
|
```
|
|
|
|
**Phase 3:** Add FTS tsvector to Note
|
|
```sql
|
|
ALTER TABLE "Note" ADD COLUMN "tsv" tsvector;
|
|
UPDATE "Note" SET "tsv" =
|
|
setweight(to_tsvector('simple', COALESCE("title", '')), 'A') ||
|
|
setweight(to_tsvector('simple', COALESCE("content", '')), 'B');
|
|
CREATE INDEX "Note_tsv_gin_idx" ON "Note" USING gin ("tsv");
|
|
CREATE OR REPLACE FUNCTION "note_tsv_trigger"() RETURNS trigger AS $$
|
|
BEGIN
|
|
NEW."tsv" :=
|
|
setweight(to_tsvector('simple', COALESCE(NEW."title", '')), 'A') ||
|
|
setweight(to_tsvector('simple', COALESCE(NEW."content", '')), 'B');
|
|
RETURN NEW;
|
|
END;
|
|
$$ LANGUAGE plpgsql;
|
|
CREATE TRIGGER "note_tsv_update"
|
|
BEFORE INSERT OR UPDATE OF "title", "content" ON "Note"
|
|
FOR EACH ROW EXECUTE FUNCTION "note_tsv_trigger"();
|
|
```
|
|
|
|
### 3. Regenerate Embeddings for Existing Notes
|
|
|
|
After the migration, all existing `NoteEmbedding` rows must have their `embedding` column regenerated from the old JSON strings to native vector format. The migration handles this conversion automatically via the `UPDATE` statement.
|
|
|
|
To reindex all notes programmatically:
|
|
```
|
|
POST /api/notes/reindex
|
|
```
|
|
|
|
### 4. Verify Deployment
|
|
|
|
**Validate embeddings:**
|
|
```
|
|
POST /api/admin/embeddings/validate
|
|
```
|
|
|
|
**Test semantic search** via the note search tool or:
|
|
```
|
|
POST /api/notes/search?q=<query>
|
|
```
|
|
|
|
## Docker Deployment
|
|
|
|
The `docker-compose.yml` runs PostgreSQL 16-alpine with the following configuration:
|
|
|
|
- **PostgreSQL port:** 5433 (host) → 5432 (container)
|
|
- **Database:** `memento` (user: `memento`, password: `memento` by default)
|
|
- **Health check:** `pg_isready`
|
|
|
|
Services that depend on the database (`memento-note`, `mcp-server`) wait for PostgreSQL to be healthy before starting.
|
|
|
|
## Rollback
|
|
|
|
To rollback to the pre-migration state:
|
|
|
|
1. Drop the HNSW index: `DROP INDEX "NoteEmbedding_embedding_hnsw_idx";`
|
|
2. Drop the GIN index: `DROP INDEX "Note_tsv_gin_idx";`
|
|
3. Drop the trigger: `DROP TRIGGER "note_tsv_update" ON "Note";`
|
|
4. Drop the function: `DROP FUNCTION "note_tsv_trigger"();`
|
|
5. Revert schema via Prisma migrate reset (requires restoring the old `NoteEmbedding.embedding` column type)
|
|
|
|
## Affected Services
|
|
|
|
| Service | Container | Port |
|
|
|---------|-----------|------|
|
|
| PostgreSQL | `memento-postgres` | 5433 |
|
|
| memento-note (Next.js) | `memento-web` | 3000 |
|
|
| mcp-server | `memento-mcp` | 3001 |
|
|
| Ollama (optional) | `memento-ollama` | 11434 |
|