This is Part 4 of 5 in the Autoresearch Week series.
This week I broke down the autoresearch pattern (Monday), ported it to SageMaker (Tuesday), and generalized it beyond ML training with binary criteria and seven principles (Wednesday). Each step moved the pattern forward — but each required a different setup: GPU fleet, Kiro Power, specific IDE.
Today I’m shipping the tool that makes the pattern accessible to anyone with a terminal: autoresearchctl.
It’s a pip-installable CLI that bakes the seven principles from Day 3, the dual eval harness (command + LLM judge), and the six mutation operators into a single tool. Six verbs: init, eval, run, log, diff, rollback. Works standalone or as the backend for any AI agent.
Repo: github.com/dgallitelli/autoresearchctl
Why a CLI?
Every implementation this week required a different setup. The SageMaker version needs AWS credentials and a GPU fleet. The Kiro Power needs Kiro. If you just want to try the pattern on a docs folder or a system prompt, you need something simpler.
pip install git+https://github.com/dgallitelli/autoresearchctl.git
autoresearchctl init
# edit .autoresearch/config.yaml
autoresearchctl run --cycles 10
That’s it. No cloud resources, no IDE plugin, no agent session. Define your target, define your criteria, run the loop.
The Six Verbs
I spent time stress-testing each command to make sure it earns its place. No forced kubectl metaphors — every verb does something the others can’t.
autoresearchctl init # Scaffold config template
autoresearchctl eval # Measure current state
autoresearchctl run # Execute the optimization loop
autoresearchctl log # Score trends + operator analysis
autoresearchctl diff 0 5 # Compare artifacts between runs
autoresearchctl rollback 3 # Restore to a previous state
eval is the one worth explaining. It runs all criteria against the current state without mutating anything — a dry measurement. Use it to validate that your criteria work before committing to a full run, or to measure your manual edits against the same bar the loop uses.
rollback exists because criteria are imperfect. The ratchet only moves forward during a run, but your human judgment might say run 12 was better than run 18 even if the score disagrees. Rollback lets your taste override the metrics.
Config as Code
Everything lives in .autoresearch/config.yaml:
target: "docs/*.md"
criteria:
- name: title_length
type: command
check: '[ $(wc -c < {file}) -lt 60 ]'
- name: has_clear_purpose
type: llm_judge
check: "Does the doc clearly state its purpose?"
mutator: rule-based # or "llm"
evaluator: local # or "bedrock" or "anthropic"
model_id: us.anthropic.claude-sonnet-4-6
max_cycles: 20
plateau_threshold: 3
Two eval types in one config. Command evals are shell commands — deterministic, free, fast. LLM judge evals call Bedrock (or Anthropic, when implemented) — for criteria that require understanding. The evaluator field controls the backend; model_id is shared across the LLM judge and the LLM mutator.
The --model, --mutator, and --evaluator CLI flags override config values, so you can experiment without editing YAML:
# Full LLM mode: Bedrock judges + LLM-powered mutations
autoresearchctl run \
--mutator llm \
--evaluator bedrock \
--model us.anthropic.claude-sonnet-4-6
A Real Run
Here’s autoresearchctl optimizing 5 documentation pages for SEO compliance — the same test case from Day 3, now running through the CLI:
$ autoresearchctl eval
File title_length meta_desc single_h1 links
---------------------------------------------------------------
advanced-usage.md PASS FAIL PASS FAIL
api-reference.md FAIL FAIL FAIL FAIL
configuration.md FAIL FAIL FAIL FAIL
getting-started.md PASS PASS PASS PASS
troubleshooting.md PASS PASS FAIL PASS
---------------------------------------------------------------
Score: 9/20 (45%)
Nine of twenty checks pass. Now the loop (non-improving cycles omitted):
$ autoresearchctl run --cycles 6
BASELINE: 9/20 (45%)
Cycle 1 [add_constraint]: 12/20 IMPROVED
+ added meta descriptions to 3 docs
Cycle 3 [add_negative_example]: 15/20 IMPROVED
+ demoted extra H1s to H2 in 3 docs
Cycle 6 [add_counterexample]: 18/20 IMPROVED
+ added internal links to 3 docs
Final score: 18/20 (90%)
45% to 90% in 6 cycles. Zero LLM calls — all four criteria are shell commands. Then I checked what actually changed:
$ autoresearchctl diff 0 best
--- run-0000/docs/api-reference.md
+++ run-0006/docs/api-reference.md
+---
+description: "Runs all criteria against..."
+---
-# API Reference
+## API Reference
-# Core Module
+## Core Module
The diff makes the changes concrete. Every score improvement maps to a specific edit.
Governance Is Built In
The loop wants to run forever. Left unconstrained, that’s great for metrics and dangerous for everything else. autoresearchctl has three structural safeguards:
max_cycles— hard stop after N iterationsplateau_threshold— stop after N cycles with no improvement (default: 3)rollback— revert to any previous state when criteria don’t capture what you actually care about
These are enforced in the loop logic, not written in a prompt the agent might ignore. The loop stops structurally, not advisorily.
One thing the CLI doesn’t do: parallel hypothesis testing. Day 2’s SageMaker orchestrator launches N experiments simultaneously; autoresearchctl runs sequential cycles. For local optimization (docs, prompts, configs), sequential is fine — each cycle takes seconds. For GPU-bound ML training, use the SageMaker version.
What’s Next
autoresearchctl gives you the pattern locally. But what about running it across 10 projects simultaneously? What about team visibility, cost governance, and audit trails? What about non-technical users who want a dashboard instead of a terminal?
Tomorrow: I’ll share the architecture for autoresearch-as-a-service — a serverless platform on AWS that takes this CLI to enterprise scale with Step Functions orchestration, budget controls, and a React dashboard. We’ll also step back and ask the bigger question: what happens when optimization loops become the default way teams improve their artifacts?
This is Part 4 of 5 in the Autoresearch Week series. Part 1 | Part 2 | Part 3
Code: autoresearchctl