#1

Search Engine

Late November 2025

PythonFastAPITypesenseReact

My very first project — a full search engine from the ground up. Keyword indexing, ranking algorithms, and a clean frontend.

What is it?

A Wikipedia RAG search engine — users submit a query, the backend searches ~100K Wikipedia articles stored in Typesense, pulls context, and streams a natural language answer from Gemini 2.5 Flash or Groq Llama 3.3-70B. Two modes: instant keyword search (no AI) and full RAG ask mode.

How it works

FastAPI exposes two endpoints: GET /search returns raw Typesense hits instantly for autocomplete-style search. GET /ask retrieves the top 3 context snippets from Typesense (each truncated to 1200 bytes), builds a RAG prompt, and calls Gemini first, falling back to Groq if Gemini is unavailable. Deployed as a systemd service on a GCP VM — not Cloud Run.

The dataset: from 1M docs to 100K

The original Wikipedia dump had ~1 million articles. Indexing all of them into Typesense on a standard GCP VM hit memory limits — Typesense couldn't hold the full index in RAM. The solution: rebuild the collection using only the top 100K most-linked articles.

This involved: wiki_indexer.py to ingest, reduce_typesense_export.py and reduce_to_100k.sh to filter, run_typesense_lowmem.sh to start Typesense with constrained memory, and a rebuild from a v1 collection (wiki_search) to a new v2 collection (wiki_docs_v2). The old indexer.log is still in the repo from the original failed full-index run.

RAG architecture

RAG (Retrieval-Augmented Generation) grounds the LLM's answer in real documents instead of training data alone. The flow: search Typesense for the query → take top 3 results → inject their content into the prompt as context → ask the LLM to answer using only that context.

Context is limited to 1200 bytes per document (3 docs = ~3600 bytes of context total). This keeps prompts short and cheap while covering the most relevant content. Gemini 2.5 Flash is the primary model — large context window, fast, low cost. Groq (llama-3.3-70b-versatile) is the fallback.

Why systemd instead of Cloud Run?

The Typesense instance runs on the same GCP VM as the FastAPI app. Cloud Run is stateless — it can't host a local database. Moving Typesense to a separate Cloud Run-incompatible managed service would add cost and latency. A GCP VM with systemd gives persistent storage, a local Typesense socket, and always-on availability with Restart=always. The tradeoff: manual scaling and no zero-cost idle.

Key takeaways

  • RAG pipeline: how to combine keyword retrieval with LLM generation for grounded answers
  • Typesense collection management: schema definition, memory constraints, rebuilding collections
  • Dataset size vs infrastructure limits — reducing 1M Wikipedia docs to fit in VM RAM
  • Systemd service deployment: unit files, Restart=always, running on GCP VMs
  • Gemini + Groq failover pattern for LLM availability
Try it live →Watch on YouTube →← all projects