Hacker News vector search dataset using ClickHouse
Dataset containing 28+ million Hacker News postings & their vector embeddings
The X "For You" feed algorithm combines posts from accounts you follow (in-network via Thunder) and discovered posts (out-of-network via Phoenix retrieval), then ranks them using a Grok-based transformer model that predicts engagement probabilities based on your interaction history.
The system eliminates hand-engineered features, relying instead on the transformer to analyze your engagement patterns (likes, replies, reposts) to determine content relevance. The pipeline applies filtering to remove duplicates, blocked content, and policy violations, then selects top-ranked posts while enforcing author diversity to prevent feed dominance by single sources.