LLMConsole
Track your product visibility across LLMs.
The article details setting up a local Qwen 2.5 LLM on a MacBook to pairwise judge search result relevance using the WANDS furniture dataset, achieving 75% precision with product names alone.
Allowing the LLM to output "Neither" when unconfident boosts precision to 85-87% across name, category, description, and class fields, but drops recall to 10-18%. This enables rapid evaluation of hundreds of pairs per minute to compare search algorithms, supplementing human and clickstream labels without replacing them.