Adversarial Trap Suite
Explore the 4 pillars of our data engine. We don't test if an agent can summarize—we test if it survives contact with enterprise reality.
12 Tasks
Multi-Hop Information
Tests iterative planning across independent queries without hallucinating connections. Validates 5+ degree chains.
Example Trap
web_search_0007
Evaluator:llm_as_a_judge
Task Definition JSON
{
"task_id": "web_search_0007",
"domain": "information_retrieval",
"difficulty": "L4",
"prompt": "Find the father of a person whose spouse created the Turing Award...",
"expected_tools": ["mcp.google-search", "mcp.wikipedia"],
"failure_criteria": "Agent hallucinates connection at hop 3 or assumes temporal overlap without verification."
}8 Tasks
Temporal Logic
Tests if the model blindly proceeds with scheduling or finds institutional loopholes when faced with contradictory deadlines.
Example Trap
grant_edge_case_0050
Evaluator:validate_irb_timeline
Task Definition JSON
{
"task_id": "grant_edge_case_0050",
"domain": "grant_application",
"difficulty": "L3",
"prompt": "Schedule the IRB review. Constraint 1: Deadline is 26 days. Constraint 2: IRB approval takes 42 days minimum.",
"expected_tools": ["mcp.calendar", "mcp.jira"],
"failure_criteria": "Agent schedules the event anyway, ignoring the 42-day physical constraint, acting as a blind calendar bot."
}6 Tasks
Inverse Compliance
Tests if the agent acts like a dumb checklist follower or can identify hidden exemptions in complex policy documents.
Example Trap
grant_document_gen_0055
Evaluator:validate_edge_case
Task Definition JSON
{
"task_id": "grant_document_gen_0055",
"domain": "compliance_generation",
"difficulty": "L3",
"prompt": "Generate the required 8 documents for this $45K grant application.",
"hidden_context": "Applicants under $50K are STRICTLY EXEMPT from submitting a Budget Justification.",
"expected_tools": ["mcp.notion.create_page"],
"failure_criteria": "Agent generates the Budget Justification anyway, violating the exemption policy."
}11 Tasks
Chained Execution
Tests strict state management across highly mutable API chains. Validates that IDs generated in Step 1 are used in Step 4.
Example Trap
gitlab_mlops_004
Evaluator:llm_judge
Task Definition JSON
{
"task_id": "gitlab_mlops_004",
"domain": "sdlc_automation",
"difficulty": "L1_L4",
"steps": [
"1: Create issue",
"2: Label issue",
"3: Assign issue",
"4: Link issue to MR #1"
],
"expected_tools": ["mcp.gitlab"],
"failure_criteria": "Agent loses the Issue ID state by step 4 and passes 'null' or a hallucinated ID to the MR linkage."
}