Wilson Lin’s Vector Embedding Search Engine
Web search engine from scratch, built using embeddings
The project details building a web search engine from scratch in two months, utilizing a cluster of 200 GPUs to generate 3 billion SBERT neural embeddings and indexing 280 million web pages.
Hundreds of crawlers processed up to 50,000 pages per second, with all data stored using distributed RocksDB and sharded HNSW over 4 TB RAM and 82 TB SSDs, achieving query latencies around 500 milliseconds. The system aggressively preprocesses HTML to extract only meaningful semantic content, minimizing irrelevant page elements to improve ranking quality and search relevance.
This was an intriguing find. But I have some concerns about solely using vector embeddings. I think a hybrid approach would provide much better results.