Adversarial Trap Suite

Explore the 4 pillars of our data engine. We don't test if an agent can summarize—we test if it survives contact with enterprise reality.

12 Tasks

Multi-Hop Information

Tests iterative planning across independent queries without hallucinating connections. Validates 5+ degree chains.

Example Trap

web_search_0007

Evaluator:llm_as_a_judge

Task Definition JSON

{
  "task_id": "web_search_0007",
  "domain": "information_retrieval",
  "difficulty": "L4",
  "prompt": "Find the father of a person whose spouse created the Turing Award...",
  "expected_tools": ["mcp.google-search", "mcp.wikipedia"],
  "failure_criteria": "Agent hallucinates connection at hop 3 or assumes temporal overlap without verification."
}

8 Tasks

Temporal Logic

Tests if the model blindly proceeds with scheduling or finds institutional loopholes when faced with contradictory deadlines.

Example Trap

grant_edge_case_0050

Evaluator:validate_irb_timeline

Task Definition JSON

{
  "task_id": "grant_edge_case_0050",
  "domain": "grant_application",
  "difficulty": "L3",
  "prompt": "Schedule the IRB review. Constraint 1: Deadline is 26 days. Constraint 2: IRB approval takes 42 days minimum.",
  "expected_tools": ["mcp.calendar", "mcp.jira"],
  "failure_criteria": "Agent schedules the event anyway, ignoring the 42-day physical constraint, acting as a blind calendar bot."
}

6 Tasks

Inverse Compliance

Tests if the agent acts like a dumb checklist follower or can identify hidden exemptions in complex policy documents.

Example Trap

grant_document_gen_0055

Evaluator:validate_edge_case

Task Definition JSON

{
  "task_id": "grant_document_gen_0055",
  "domain": "compliance_generation",
  "difficulty": "L3",
  "prompt": "Generate the required 8 documents for this $45K grant application.",
  "hidden_context": "Applicants under $50K are STRICTLY EXEMPT from submitting a Budget Justification.",
  "expected_tools": ["mcp.notion.create_page"],
  "failure_criteria": "Agent generates the Budget Justification anyway, violating the exemption policy."
}

11 Tasks

Chained Execution

Tests strict state management across highly mutable API chains. Validates that IDs generated in Step 1 are used in Step 4.

Example Trap

gitlab_mlops_004

Evaluator:llm_judge

Task Definition JSON

{
  "task_id": "gitlab_mlops_004",
  "domain": "sdlc_automation",
  "difficulty": "L1_L4",
  "steps": [
    "1: Create issue",
    "2: Label issue",
    "3: Assign issue",
    "4: Link issue to MR #1"
  ],
  "expected_tools": ["mcp.gitlab"],
  "failure_criteria": "Agent loses the Issue ID state by step 4 and passes 'null' or a hallucinated ID to the MR linkage."
}

Overview

Certification

Intelligence

Settings

Adversarial Trap Suite

Multi-Hop Information

Temporal Logic

Inverse Compliance

Chained Execution

•••Overview

•••Certification

•••Intelligence

•••Settings

Adversarial Trap Suite

Multi-Hop Information

Temporal Logic

Inverse Compliance

Chained Execution

Overview

Certification

Intelligence

Settings