Netflix's Ranker service optimized its serendipity scoring feature, which consumed 7.5% of CPU by converting inefficient nested loops into batched matrix multiplication operations.
The team restructured memory layout using flat buffers and ThreadLocal reuse to improve cache locality and reduce garbage collection pressure, then evaluated various compute kernels including BLAS to achieve lower CPU usage per request. The optimization demonstrates that algorithmic improvements require careful attention to implementation details like memory layout and allocation strategy to yield real performance gains at scale.