pi-autoresearch: Autonomous experiment loop extension for pi

pi-autoresearch is an automated experimentation loop tool built for pi, adaptable to a variety of optimization goals including test speed, bundle size, LLM training performance, build time, Lighthouse scores, and more.

Type	Description
Extension	Toolset + live component + `/autoresearch` dashboard
Skill	Define optimization goal, generate session files, start the experiment loop

Tool	Description
`init_experiment`	One‑time session configuration – sets experiment name, metric, unit, and optimization direction (min/max).
`run_experiment`	Executes any command, measures wall‑clock time, and captures output.
`log_experiment`	Logs experiment results, automatically commits code, and updates components and dashboard.

Status Component

A status component is always visible at the top of the editor. Example: 🔬 autoresearch 12 runs 8 kept │ best: 42.3s It shows the number of runs, kept runs, and the best result so far.

Full Results Dashboard

Type /autoresearch to open the full‑featured results dashboard.

Ctrl+X toggles the display.
Escape closes it. All experiment data is aggregated here for easy review.

Skill Core Functionality

When you invoke the autoresearch-create skill, the tool will ask (or infer from context) for the goal, command, metric, and relevant file scope. It then generates two core files and immediately starts the experiment loop. Optionally, a check script can be created:

File	Purpose
`autoresearch.md`	Session document – records goal, metric, relevant file scope, and tried approaches. A new agent can resume the session using only this file.
`autoresearch.sh`	Benchmark script – contains prerequisite checks and the task execution logic. Outputs a metric line in the format `METRIC name=number`.
`autoresearch.checks.sh`	(Optional) Reverse‑pressure check script – runs tests, type checks, lints, etc. after each successful benchmark. If this script fails, the “keep” operation is prevented.

Installing pi-autoresearch

Quick Install

Run the following command:

pi install https://github.com/davebcn87/pi-autoresearch

Manual Install

Copy the extension and skill files to the appropriate directories:

cp -r extensions/pi-autoresearch ~/.pi/agent/extensions/
cp -r skills/autoresearch-create ~/.pi/agent/skills/

In pi, execute /reload to load the newly added extension and skill.

Using pi-autoresearch

1. Start an Automated Research Session

Type the following command to start the skill:

/skill:autoresearch-create

The agent will ask for the optimization goal, command, metric, and relevant file scope (or infer them from context). It then creates a branch, generates autoresearch.md and autoresearch.sh, runs the baseline benchmark, and immediately begins the experiment loop.

2. Automated Experiment Loop

The agent autonomously executes the loop:

Edit code → commit changes → run run_experiment → run log_experiment → keep effective changes or roll back ineffective ones → repeat. The loop continues indefinitely unless manually interrupted.

Every experiment result is appended to autoresearch.jsonl in the project. Each line represents one run and has the following features:

Resume‑friendly: The agent can read this file to recover an interrupted session.
Context‑reset‑friendly: autoresearch.md records all attempted approaches, so a new agent can obtain the full context.
Human‑readable: You can open the file anytime to see the complete experiment history.
Branch‑aware: Each branch has its own independent experiment session.

3. Monitor Experiment Progress

Status component: Real‑time core data is shown at the top of the editor.
Dashboard: Type /autoresearch to open the full dashboard, view results tables, and see the best run.
Interruption summary: Press Escape at any time to stop the loop; the agent will provide a summary report of the experiments.

Example Use Cases

Scenario	Optimization Metric	Command
Test speed optimization	seconds (lower is better)	`pnpm test`
Bundle size optimization	kilobytes (lower is better)	`pnpm build && du -sb dist`
LLM training optimization	bits per byte on validation set (lower is better)	`uv run train.py`
Build speed optimization	seconds (lower is better)	`pnpm build`
Lighthouse score optimization	performance score (higher is better)	`lighthouse http://localhost:3000 --output=json`

How It Works

The extension provides domain‑agnostic infrastructure, while the skill holds domain‑specific knowledge. This separation lets one extension support an unlimited number of application domains.

┌──────────────────────┐ ┌──────────────────────────┐
│ Extension (global) │ │ Skill (per domain) │
│ │ │ │
│ run_experiment │◄────│ Command: pnpm test │
│ log_experiment │ │ Metric: seconds (min) │
│ component + dashboard│ │ Scope: vitest config │
│ │ │ Ideas: pooling, parallel│
└──────────────────────┘ └──────────────────────────┘

Two core files ensure that sessions can survive restarts and context resets:

autoresearch.jsonl – Append‑only log file, recording per‑run metrics, status, commit hashes, and descriptions.
autoresearch.md – Dynamically updated document containing the goal, tried approaches, dead ends, and key achievements.

Optional Reverse‑Pressure Checks

Create an autoresearch.checks.sh file to add correctness checks (e.g., tests, type checks, linters). This ensures optimizations don’t break existing functionality. Example script:

#!/bin/bash
set -euo pipefail
pnpm test --run
pnpm typecheck

How Checks Work

If the file does not exist, the experiment loop runs normally, with no extra effect.
If the file exists, it is automatically executed after every successful benchmark (exit code 0).
The check execution time does not affect the core metric timing.
If a check fails, the experiment is marked as checks_failed (behavior identical to a crash: no code commit, changes are rolled back).
The dashboard separately displays the checks_failed status, distinguishing correctness failures from benchmark crashes.
Checks have their own timeout (default 300 seconds, configurable via checks_timeout_seconds in run_experiment).