pharo-infer

PharoInfer

Pharo 13 & 14 License: MIT PRs Welcome Status: Active

PharoInfer is a fully in-image inference engine for Pharo Smalltalk. It loads a GGUF model file directly from disk and drives llama.cpp through UFFI — there is no HTTP server, no Ollama bridge, and no subprocess. Talk to the model straight from the image.

Requirements

Point PharoInfer at your libllama

Pharo will look for libllama.so (or the platform equivalent) on the default library search path. To override, pin it from the image:

AILlamaLibrary libraryPath: '/home/me/llama.cpp/build/libllama.so'.

Installation

Metacello new
  githubUser: 'pharo-llm' project: 'pharo-infer' commitish: 'main' path: 'src';
  baseline: 'AIPharoInfer';
  load.

Quick Start

Text completion, in-image

| manager engine model |
manager := AIModelManager default.
manager currentBackend: AILocalBackend new.

model := manager loadModel:
    (FileLocator home / 'models' / 'tiny.gguf') fullName.

engine := AIInferenceEngine default.
engine backend: manager currentBackend.
engine complete: 'Hello from Pharo!' model: model name.

Streaming

engine
    stream: 'Tell me a joke about Smalltalk'
    model: model name
    onToken: [ :piece | Transcript show: piece ].

Chat

| request |
request := AIChatCompletionRequest
    model: model name
    messages: {
        AIChatMessage system: 'You are a helpful AI assistant.'.
        AIChatMessage user: 'What is Smalltalk?' }.
AIChatAPI default complete: request.

GPU offload and threads

AILocalBackend new
    nGpuLayers: 999; "offload all layers"
    nThreads: 8;
    contextSize: 4096.

Architecture