APIEval-20
An open benchmark for AI agents that test APIs
APIEval-20 serves as a robust black-box benchmark for evaluating API testing agents. It provides a structured environment where agents receive only a JSON schema and a single sample payload, from which they must generate a comprehensive test suite.
The benchmark then executes these generated tests against live reference APIs that intentionally contain planted bugs. Evaluation is objective and precise, measuring bug detection, API coverage, and overall efficiency, diverging from subjective, model-as-judge assessments. This ensures that a bug is definitively either caught or missed, providing clear performance metrics. Tasks within APIEval-20 span diverse aspects of API functionality including authentication, error handling, pagination, schema adherence, and complex multi-step workflows. It is openly available on Hugging Face, promoting transparency and collaborative development in API testing.
APIEval-20 is not just a benchmark for models, but a task-centric evaluation for end-to-end agent behavior. It challenges agents to reason about the entire API surface, design targeted tests, and uncover real-world bugs without access to source code, documentation, or prior knowledge. This focus on practical, black-box testing reflects common development scenarios where teams need to quickly construct meaningful tests from limited contextual information. The benchmark features 20 distinct API scenarios across 7 application domains, meticulously designed to cover a broad spectrum of validation patterns, business logic intricacies, and security considerations, including e-commerce, payments, and user management.
This platform is ideal for developers and researchers creating tools that automate API quality assurance. It helps validate the capability of testing agents to perform rigorous API validation, identify vulnerabilities, and ensure robust software functioning.