Why we don't benchmark

We don’t validate whether XML-tagged prompts outperform pure Markdown empirically. Here’s why: Our goal is enduring guidance—advice that remains valid as models improve. The only durable foundation is what frontier labs recommend, which also reflects what they use in training. Empirical validation with today’s SOTA models would mean making decisions about the future based on the past. When Anthropic, OpenAI, or Google update their guidance, we update ours.

The format

Tooling

⌘I

Overview

Reference

Why we don't benchmark