chatpharo

Benchmarks

The AI-ChatPharo-Benchmarks package gives you a complete, reproducible benchmarking pipeline for code generation (and related tasks) that runs inside the same Pharo image as the rest of ChatPharo.

It answers questions like:

It is dynamic: benchmarks are extracted from a live Pharo package, so when the package changes the gold data follows automatically — no hand-curated dataset to maintain.


Quick start

"1. Build a benchmark from a live package."
bench := ChatPharoBenchmarksAPI buildFromPackage: 'AI-ChatPharo-Tools'.
ChatPharoBenchmarksAPI save: bench.

"2. Run it through the full ChatPharo ecosystem (tools, browser env, history)."
agent  := ChatPharoSettings default agent.
report := ChatPharoBenchmarksAPI
              run: bench
              withAgent: agent
              strategy: #ecosystem.
ChatPharoBenchmarksAPI save: report.

"3. Compare ecosystem vs raw LLM head-to-head."
comparison := ChatPharoBenchmarksAPI
                  compare: bench
                  ecosystemAgent: agent
                  rawAgent: agent.
comparison asDictionary.

Files land under ~/Documents/chatpharo-benchmarks/ by default — change the location with ChatPharoBenchmarkStore root: aFileReference.


Architecture

ChatPharoBenchmarksAPI       ← single-class facade, what most users call
        │
        ├── ChatPharoBenchmarkBuilder         ← package → benchmark (gold data)
        ├── ChatPharoBenchmarkRunner          ← benchmark + config → report
        ├── ChatPharoBenchmarkComparison      ← N reports → diff
        └── ChatPharoBenchmarkStore           ← JSON persistence
                                              (default: ~/Documents/chatpharo-benchmarks)

Entities
  ChatPharoBenchmark        — collection of cases + metadata
  ChatPharoBenchmarkCase    — one atomic task (method to generate, …)
  ChatPharoBenchmarkResult  — outcome of one case under one configuration
  ChatPharoBenchmarkReport  — aggregated results of one full run

Configurability
  ChatPharoBenchmarkConfiguration   — strategy + agent + scorers + threshold
  ChatPharoBenchmarkPromptTemplate  — how cases are rendered into prompts
  ChatPharoBenchmarkCodeExtractor   — pulls Smalltalk out of LLM responses

Scorers (all subclasses of ChatPharoBenchmarkScorer)
  ChatPharoBenchmarkExactMatchScorer        — strict string equality
  ChatPharoBenchmarkNormalizedMatchScorer   — whitespace-insensitive equality
  ChatPharoBenchmarkASTMatchScorer          — structural AST equality
  ChatPharoBenchmarkCompilationScorer       — does it parse?
  ChatPharoBenchmarkTokenOverlapScorer      — Jaccard over identifiers

Building a benchmark from a package

ChatPharoBenchmarkBuilder walks a Pharo package and emits one case per method that passes the filters.

bench := ChatPharoBenchmarkBuilder new
    packageNamed: 'AI-ChatPharo-Tools';
    kind: #methodGeneration;          "or #methodExplanation, #classGeneration, #snippetCompletion"
    excludeProtocol: 'initialization';
    excludeProtocol: 'tests';
    minSourceSize: 60;                 "skip trivial getters/setters"
    maxSourceSize: 4000;               "skip giant methods"
    maxCases: 100;                     "cap the run"
    selectorFilter: [ :sel | (sel beginsWith: 'private') not ];
    build.

Every case is given a deterministic id of the form <package>/<Class>/<selector>/<kind> so reports stay comparable across package versions, and diffs are meaningful when the package does change.


Running a benchmark

Two strategies are first-class citizens:

#ecosystem — full ChatPharo

The agent is used verbatim through agent copyForChat, which means:

Each ChatPharoBenchmarkResult records:

#rawLLM — bypass the ecosystem

A single-shot call goes directly to the vendor:

so you measure the underlying model on its own. The result records toolsEnabled = false and toolCallCount = 0.

Same agent, both strategies

Pointing the same agent at both strategies is the canonical apples-to-apples comparison:

ChatPharoBenchmarksAPI
    compare: bench
    ecosystemAgent: agent
    rawAgent:       agent.

Multi-config matrix

configs := {
    ChatPharoBenchmarkConfiguration ecosystemWith: claudeAgent.
    ChatPharoBenchmarkConfiguration rawLLMWith:    claudeAgent.
    ChatPharoBenchmarkConfiguration ecosystemWith: ollamaAgent.
    ChatPharoBenchmarkConfiguration rawLLMWith:    ollamaAgent.
}.
comparison := ChatPharoBenchmarksAPI compare: bench withConfigurations: configs.

Scoring

By default each result is scored by all built-in scorers; the primaryScorerName (default 'normalizedMatch') decides pass/fail via passThreshold (default 0.9).

Scorer name What it measures
exactMatch Strict string equality with the gold source.
normalizedMatch Equality after collapsing whitespace.
astMatch Structural equality of the parsed Pharo AST.
compilation 1.0 iff the generated code parses as a method/expression.
tokenOverlap Jaccard similarity over identifier tokens.

Add your own by subclassing ChatPharoBenchmarkScorer and overriding #scoreName and #score:against:. Configure with:

config scorers: { MyScorer new. ChatPharoBenchmarkASTMatchScorer new }.
config primaryScorerName: 'mySemanticScore'.
config passThreshold: 0.75.

Persistence

ChatPharoBenchmarkStore saves benchmarks and reports as JSON via STONJSON. Layout:

<root>/
  benchmarks/
    AI-ChatPharo-Tools-methodGeneration-2026-04-15T….json
  reports/
    AI-ChatPharo-Tools-methodGeneration/
      ecosystem-claude-3-5-sonnet-2026-04-15T….json
      rawLLM-claude-3-5-sonnet-2026-04-15T….json

Every report includes a top-level summary block (pass rate, average latency, total tokens, total tool calls, mean of every scorer) so a dashboard can read it without reprocessing the per-case results.

Reload anytime:

report := ChatPharoBenchmarksAPI loadReportFrom:
    (FileLocator documents / 'chatpharo-benchmarks' / 'reports' / '…').
report passRate.

Why this design