Adversarial Trap Suite

Explore the 4 pillars of our data engine. We don't test if an agent can summarize—we test if it survives contact with enterprise reality.

12 Tasks

Multi-Hop Information

Tests iterative planning across independent queries without hallucinating connections. Validates 5+ degree chains.

Example Trap
web_search_0007
Evaluator:llm_as_a_judge
Task Definition JSON
{
  "task_id": "web_search_0007",
  "domain": "information_retrieval",
  "difficulty": "L4",
  "prompt": "Find the father of a person whose spouse created the Turing Award...",
  "expected_tools": ["mcp.google-search", "mcp.wikipedia"],
  "failure_criteria": "Agent hallucinates connection at hop 3 or assumes temporal overlap without verification."
}
8 Tasks

Temporal Logic

Tests if the model blindly proceeds with scheduling or finds institutional loopholes when faced with contradictory deadlines.

Example Trap
grant_edge_case_0050
Evaluator:validate_irb_timeline
Task Definition JSON
{
  "task_id": "grant_edge_case_0050",
  "domain": "grant_application",
  "difficulty": "L3",
  "prompt": "Schedule the IRB review. Constraint 1: Deadline is 26 days. Constraint 2: IRB approval takes 42 days minimum.",
  "expected_tools": ["mcp.calendar", "mcp.jira"],
  "failure_criteria": "Agent schedules the event anyway, ignoring the 42-day physical constraint, acting as a blind calendar bot."
}
6 Tasks

Inverse Compliance

Tests if the agent acts like a dumb checklist follower or can identify hidden exemptions in complex policy documents.

Example Trap
grant_document_gen_0055
Evaluator:validate_edge_case
Task Definition JSON
{
  "task_id": "grant_document_gen_0055",
  "domain": "compliance_generation",
  "difficulty": "L3",
  "prompt": "Generate the required 8 documents for this $45K grant application.",
  "hidden_context": "Applicants under $50K are STRICTLY EXEMPT from submitting a Budget Justification.",
  "expected_tools": ["mcp.notion.create_page"],
  "failure_criteria": "Agent generates the Budget Justification anyway, violating the exemption policy."
}
11 Tasks

Chained Execution

Tests strict state management across highly mutable API chains. Validates that IDs generated in Step 1 are used in Step 4.

Example Trap
gitlab_mlops_004
Evaluator:llm_judge
Task Definition JSON
{
  "task_id": "gitlab_mlops_004",
  "domain": "sdlc_automation",
  "difficulty": "L1_L4",
  "steps": [
    "1: Create issue",
    "2: Label issue",
    "3: Assign issue",
    "4: Link issue to MR #1"
  ],
  "expected_tools": ["mcp.gitlab"],
  "failure_criteria": "Agent loses the Issue ID state by step 4 and passes 'null' or a hallucinated ID to the MR linkage."
}