AT
A/B Test Setup Skills Review: Forcing Statistical Rigor in Experiment Design
bestskills rank team
2026-04-15

A structured teardown of ab-test-setup skills. We analyze how this skill in openclaw/hermes agent helps growth teams avoid common testing pitfalls through forced structured hypotheses, single-variable principles, and preset sample sizes.


ab-test-setup-review

Skill Quality Assessment Report: ab-test-setup

Assessment Time: 2026-04-15 Assessment Mode: Line-by-line Review

Overall Score

DimensionScoreStatus
Standards (20%)12/20WARN
Effectiveness (40%)25/40WARN
Safety (30%)30/30PASS
Conciseness (10%)4/10FAIL
Total Score71/100Good

Grade Scale:

  • 70-89: Good — Usable but has room for targeted improvements.

Strengths

  1. [Effectiveness] The use of progressive disclosure is excellent. The skill separates detailed data tables and templates into independent reference files (e.g., references/sample-size-guide.md), keeping the main document clean. — Reference: For detailed sample size tables and duration calculations: See references/sample-size-guide.md
  2. [Effectiveness] It provides clear initial assessment guidelines and task-specific questions. By requiring the agent to gather context before acting, it prevents blind, low-quality outputs. — Reference: ## Initial Assessment and ## Task-Specific Questions
  3. [Safety] The content is completely safe. There are no risky operations or system-level destructive commands.

Areas for Improvement

  1. [Standards] The YAML frontmatter is incomplete. It’s missing author, license, and metadata.hermes.tags. Also, the skill name doesn’t use the recommended verb-ing format. — Reference: The YAML block at the beginning. Impact: Reduces discoverability and doesn’t follow strict metadata standards.
  2. [Effectiveness] The skill lacks a structured, executable workflow for the agent. It reads more like a wiki article about A/B testing rather than an operational guide. — Reference: The overall document structure. Impact: Agents might output inconsistent formats because they don’t have step-by-step instructions.
  3. [Conciseness] The document wastes tokens on basic concepts. Explaining what an A/B test is or defining statistical significance (p-value < 0.05) is redundant since LLMs already know this. — Reference: Sections like ## Test Types and ## Analyzing Results. Impact: Burns context window space and increases response latency without adding actionable value.

Key Takeaways

  1. Context-First Strategy: The Initial Assessment section explicitly tells the agent to read .agents/product-marketing-context.md before asking questions. — Application: Useful for any skill that relies heavily on project-specific business context.
  2. Structured Question Checklists: Grouping necessary user inputs under a dedicated Task-Specific Questions section is a smart pattern. — Application: Any interactive skill that requires multi-turn dialogue to gather requirements.

Detailed Issue List

[Medium] Standards — Missing Metadata and Naming Convention

  • Location: YAML Frontmatter
  • Description: Missing author, license, and tag fields. The name ab-test-setup is a noun phrase instead of a verb phrase.
  • Recommendation: Add the complete metadata.hermes fields and rename the skill to setting-up-ab-tests or designing-ab-tests.

[Medium] Effectiveness — Lacks Structured Workflow

  • Location: Global
  • Description: The text provides principles but no concrete execution steps or output templates.
  • Recommendation: Add a ## Workflow section with an ordered list so the agent knows exactly how to process a request (e.g., 1. Ask questions -> 2. Frame hypothesis -> 3. Calculate sample size -> 4. Output plan).

[Severe] Conciseness — Heavy on Basic Explanations

  • Location: ## Test Types, ## Sample Size, ## Analyzing Results
  • Description: Spends too much space explaining basic A/B testing concepts and math that the LLM inherently understands.
  • Recommendation: Strip out the textbook definitions. Keep only the specific decision criteria, templates, and guardrails relevant to the task.

Improvement Recommendations (By Priority)

  1. [Required] Remove basic textbook explanations to heavily reduce token consumption.
  2. [Required] Add a concrete Workflow section to turn knowledge into actionable agent steps.
  3. [Recommended] Complete the YAML metadata and standardize the file name.
  4. [Recommended] Provide an explicit output template so the generated test plans have a consistent structure.

Related Resources

Recommended Reading