Real-time Streaming Search Engine
December 12, 2025
Perplexity.ai-style search with token-by-token streaming answers. Hybrid search combining keyword + semantic + BM25, with clickable source citations. Cost ~$20 on GCP.
What is it?
A Perplexity.ai-style search engine that streams token-by-token answers with clickable source citations. Hybrid search combines keyword, vector, and full-text signals via Reciprocal Rank Fusion. The backend evolution is documented in five numbered step folders — each one solving a problem the previous approach hit.
The step-by-step build (what actually happened)
Step 1: Ingest Wikipedia dump to Typesense, generate embeddings with SentenceTransformers (all-MiniLM-L6-v2, 384-dim) on GPU.
Step 2 — NOT WORKING: Tried Vertex AI gemini-embedding-001 (3072-dim) with ChromaDB. Hit quota limits on the GPU VM. The folder is literally named step2/not-working-gemini-issue/. Logs document exactly why it failed.
Step 2 — WORKING: Create a GPU VM → run embeddings → save to GCS bucket → delete VM → import back. Separating the expensive GPU step from persistent storage.
Step 3: Re-embedding pass to patch gaps from step 2, logged in vector_patcher.log.
Step 4: Rebuild the full Typesense collection as wiki_docs_v2 (384-dim), replacing the earlier wiki_search schema. Extensive debugging subfolder: check_batch.py, check_schema.py, inspect_first_doc.py. Ends with swap_collection.py doing an atomic live cutover.
Step 5: Production API with hybrid search, SSE streaming, and SQLite FTS5 deep mode.
Hybrid search implementation
Two parallel Typesense queries fire simultaneously: a lexical query (q=query, query_by=title,text) and a vector query (q=*, vector_query=embedding:[...]) using the locally-computed SentenceTransformer embedding.
Manual RRF merge with alpha=0.3 (vector weight) and k=60. For each document: score = (1 - alpha) × 1/(lexical_rank + k) + alpha × 1/(vector_rank + k). Documents consistently appearing in both ranked lists surface to the top. The match type is labeled text, vector, or text+vector so users can see which signal found each result.
SSE streaming
The answer generation uses sse-starlette. The FastAPI endpoint yields chunks from Gemini 2.5 Flash (primary) or Groq (fallback: llama-3.1-8b-instant, llama-3.3-70b-versatile, with newer gpt-oss and qwen models on Groq also available). Each yielded chunk is a data: {token}\n\n SSE frame.
The browser's EventSource API fires onmessage for each frame. No WebSocket upgrade needed — standard HTTP/2 long-lived GET. SSE reconnects automatically on disconnect; the server side uses sse-starlette's built-in disconnect detection.
Vector infrastructure and abandoned approaches
The final stack uses Typesense for both lexical and vector search — one service, one client. Before landing here: ChromaDB was tried and abandoned (quota issue), LanceDB was set up (lancedb_data/wiki.lance/ is still in the repo), Pinecone was used for ImageSearchEngine, Qdrant for video search. The decision to unify on Typesense avoided running a separate vector DB service.
Key takeaways
- Hybrid search: RRF formula with alpha weighting, lexical vs semantic signal tradeoffs
- GPU cost management: create VM for batch embeddings, delete when done, re-import from GCS
- Typesense multi-search API: running lexical and vector queries in one round-trip
- SentenceTransformers all-MiniLM-L6-v2: 384-dim, fast CPU inference, good quality tradeoff
- Atomic collection swap in Typesense: alias pointing, zero-downtime schema migration