The AI-ChatPharo-Benchmarks package gives you a complete, reproducible
benchmarking pipeline for code generation (and related tasks) that runs
inside the same Pharo image as the rest of ChatPharo.
It answers questions like:
#ecosystem and #rawLLM and compare.)It is dynamic: benchmarks are extracted from a live Pharo package, so when the package changes the gold data follows automatically — no hand-curated dataset to maintain.
"1. Build a benchmark from a live package."
bench := ChatPharoBenchmarksAPI buildFromPackage: 'AI-ChatPharo-Tools'.
ChatPharoBenchmarksAPI save: bench.
"2. Run it through the full ChatPharo ecosystem (tools, browser env, history)."
agent := ChatPharoSettings default agent.
report := ChatPharoBenchmarksAPI
run: bench
withAgent: agent
strategy: #ecosystem.
ChatPharoBenchmarksAPI save: report.
"3. Compare ecosystem vs raw LLM head-to-head."
comparison := ChatPharoBenchmarksAPI
compare: bench
ecosystemAgent: agent
rawAgent: agent.
comparison asDictionary.
Files land under ~/Documents/chatpharo-benchmarks/ by default — change the
location with ChatPharoBenchmarkStore root: aFileReference.
ChatPharoBenchmarksAPI ← single-class facade, what most users call
│
├── ChatPharoBenchmarkBuilder ← package → benchmark (gold data)
├── ChatPharoBenchmarkRunner ← benchmark + config → report
├── ChatPharoBenchmarkComparison ← N reports → diff
└── ChatPharoBenchmarkStore ← JSON persistence
(default: ~/Documents/chatpharo-benchmarks)
Entities
ChatPharoBenchmark — collection of cases + metadata
ChatPharoBenchmarkCase — one atomic task (method to generate, …)
ChatPharoBenchmarkResult — outcome of one case under one configuration
ChatPharoBenchmarkReport — aggregated results of one full run
Configurability
ChatPharoBenchmarkConfiguration — strategy + agent + scorers + threshold
ChatPharoBenchmarkPromptTemplate — how cases are rendered into prompts
ChatPharoBenchmarkCodeExtractor — pulls Smalltalk out of LLM responses
Scorers (all subclasses of ChatPharoBenchmarkScorer)
ChatPharoBenchmarkExactMatchScorer — strict string equality
ChatPharoBenchmarkNormalizedMatchScorer — whitespace-insensitive equality
ChatPharoBenchmarkASTMatchScorer — structural AST equality
ChatPharoBenchmarkCompilationScorer — does it parse?
ChatPharoBenchmarkTokenOverlapScorer — Jaccard over identifiers
ChatPharoBenchmarkBuilder walks a Pharo package and emits one case per
method that passes the filters.
bench := ChatPharoBenchmarkBuilder new
packageNamed: 'AI-ChatPharo-Tools';
kind: #methodGeneration; "or #methodExplanation, #classGeneration, #snippetCompletion"
excludeProtocol: 'initialization';
excludeProtocol: 'tests';
minSourceSize: 60; "skip trivial getters/setters"
maxSourceSize: 4000; "skip giant methods"
maxCases: 100; "cap the run"
selectorFilter: [ :sel | (sel beginsWith: 'private') not ];
build.
Every case is given a deterministic id of the form
<package>/<Class>/<selector>/<kind> so reports stay comparable across
package versions, and diffs are meaningful when the package does change.
Two strategies are first-class citizens:
#ecosystem — full ChatPharoThe agent is used verbatim through agent copyForChat, which means:
ChatPharoBrowserEnvironment tools are attached,Each ChatPharoBenchmarkResult records:
agentClassName, modelName, apiBaseURL — which API was used,toolsEnabled = true, toolCallCount, toolCallNames — whether and
how the model used tools,promptTokens, completionTokens, totalTokens, latencyMs — cost.#rawLLM — bypass the ecosystemA single-shot call goes directly to the vendor:
tools: #()),so you measure the underlying model on its own. The result records
toolsEnabled = false and toolCallCount = 0.
Pointing the same agent at both strategies is the canonical apples-to-apples comparison:
ChatPharoBenchmarksAPI
compare: bench
ecosystemAgent: agent
rawAgent: agent.
configs := {
ChatPharoBenchmarkConfiguration ecosystemWith: claudeAgent.
ChatPharoBenchmarkConfiguration rawLLMWith: claudeAgent.
ChatPharoBenchmarkConfiguration ecosystemWith: ollamaAgent.
ChatPharoBenchmarkConfiguration rawLLMWith: ollamaAgent.
}.
comparison := ChatPharoBenchmarksAPI compare: bench withConfigurations: configs.
By default each result is scored by all built-in scorers; the
primaryScorerName (default 'normalizedMatch') decides pass/fail via
passThreshold (default 0.9).
| Scorer name | What it measures |
|---|---|
exactMatch |
Strict string equality with the gold source. |
normalizedMatch |
Equality after collapsing whitespace. |
astMatch |
Structural equality of the parsed Pharo AST. |
compilation |
1.0 iff the generated code parses as a method/expression. |
tokenOverlap |
Jaccard similarity over identifier tokens. |
Add your own by subclassing ChatPharoBenchmarkScorer and overriding
#scoreName and #score:against:. Configure with:
config scorers: { MyScorer new. ChatPharoBenchmarkASTMatchScorer new }.
config primaryScorerName: 'mySemanticScore'.
config passThreshold: 0.75.
ChatPharoBenchmarkStore saves benchmarks and reports as JSON via
STONJSON. Layout:
<root>/
benchmarks/
AI-ChatPharo-Tools-methodGeneration-2026-04-15T….json
reports/
AI-ChatPharo-Tools-methodGeneration/
ecosystem-claude-3-5-sonnet-2026-04-15T….json
rawLLM-claude-3-5-sonnet-2026-04-15T….json
Every report includes a top-level summary block (pass rate, average
latency, total tokens, total tool calls, mean of every scorer) so a
dashboard can read it without reprocessing the per-case results.
Reload anytime:
report := ChatPharoBenchmarksAPI loadReportFrom:
(FileLocator documents / 'chatpharo-benchmarks' / 'reports' / '…').
report passRate.
#ecosystem strategy invokes the
exact same getResponseForPrompt: that the chat UI uses, including tools,
history, and the multi-iteration loop — no parallel test harness to keep
in sync.#rawLLM strategy strips the ecosystem
entirely and talks to the same vendor with tools: #() and a neutral
system prompt, so the comparison is meaningful.