Factory Research evaluated context compression for AI agents using probe-based tests. They found structured summarization preserves more details than OpenAI or Anthropic methods during long sessions.
Highlights
Traditional metrics like ROUGE fail to measure functional context preservation.
Probe-based tests verify if agents recall specific details after compression.
Structured summarization outperformed OpenAI and Anthropic in debugging tasks.
Optimizing for tokens per task improves agent productivity over tokens per request.
Testing covered debugging, code review, and ML research scenarios.