Design Decisions

This page summarizes the key architectural decisions in skelegent and the reasoning behind each one.

Why `#[async_trait]` instead of native async traits

Decision: All Layer 0 protocol traits use #[async_trait] (heap-allocated futures). Internal traits in Layer 1 (like Provider) use RPITIT (native async, zero-cost).

Reasoning: Rust stabilized async fn in traits, but async fn in dyn Trait is still not available natively. Layer 0 traits must be object-safe because the entire composition model depends on Box<dyn Operator>, Arc<dyn StateStore>, etc. The async_trait macro provides this by boxing the returned future.

Internal traits like Provider are never used behind dyn – they appear as generic type parameters in operator wrappers around react_loop (e.g., struct MyOperator<P: Provider>). These can use RPITIT for zero-cost abstraction. The object-safe boundary is the Operator trait, which is the protocol boundary.

Future: When Rust stabilizes async fn in dyn Trait with Send bounds, the Layer 0 traits will migrate to native async. This will be a breaking change in a minor version before v1.0.

Why `serde_json::Value` for state values

Decision: StateStore stores serde_json::Value, not generic T: Serialize.

Reasoning: A generic T would destroy object safety. StateStore must work as dyn StateStore because orchestrators, environments, and operators all share a state store through trait objects. Making the trait generic over the value type would require callers to agree on concrete types at compile time, defeating the purpose of dynamic composition.

serde_json::Value is the universal interchange format for agentic systems. Every LLM API speaks JSON. Every tool accepts and returns JSON. The cost (no compile-time schema checking) is acceptable because state data crosses process boundaries, is persisted to disk, and may be read by different versions of the code.

Why `rust_decimal::Decimal` for cost tracking

Decision: All monetary values (OperatorMetadata.cost, OperatorConfig.max_cost) use rust_decimal::Decimal.

Reasoning: Floating-point accumulation errors are real when tracking spend across thousands of LLM calls. f64 introduces rounding errors that compound over time. A system that runs 10,000 model calls per day, each costing fractions of a cent, needs exact arithmetic to produce accurate cost reports and enforce budgets precisely.

Decimal adds one dependency to Layer 0 but eliminates an entire class of bugs.

Why four protocol concerns plus middleware interfaces

Decision: The architecture has four Layer 0 protocol traits (Operator, Dispatcher, StateStore, Environment) plus Signalable and Queryable at Layer 2 (skg-effects-core), and two cross-cutting interfaces (per-boundary middleware, lifecycle events).

Reasoning: The four concerns are orthogonal and compose independently:

Operator – What happens in a single agent cycle (reasoning + acting).
Dispatch/Signal/Query – How multiple agents compose (topology + durability). Dispatcher invokes, Signalable delivers signals, Queryable reads workflow state.
State – How data persists (storage backend).
Environment – Where code runs (isolation + credentials).

These were derived from analyzing 23 architectural decisions that every agentic system must make. The four concerns cover all 23 decisions without overlap. Reducing to three concerns (by merging state into environment, or orchestration into operator) creates coupling where orthogonal concerns should be independent. Expanding to more concerns creates distinctions without meaningful boundaries.

Middleware is the cross-cutting Layer 0 surface because it intercepts stable protocol seams without turning runtime policy into protocol API. Budget/compaction coordination and observation/intervention mechanics live in runtime or orchestration code above Layer 0 unless they are later promoted into a real cross-boundary contract. Layer 0 only carries the middleware seams and message-level hints that travel with data, such as CompactionPolicy.

Why edition 2024

Decision: The workspace uses Rust edition 2024.

Reasoning: Edition 2024 is the latest stable edition and provides native support for RPITIT (return position impl trait in traits) and other modern language features. This allows traits like Provider to use zero-cost async abstractions without workarounds like the async_trait macro. The Rust ecosystem has fully adopted 2024, providing excellent compatibility with all core dependencies.

Why `#[non_exhaustive]` on all enums and structs

Decision: All public enums (ExitReason, TriggerType, etc.) and structs (OperatorInput, OperatorOutput, OperatorConfig, etc.) in Layer 0 are marked #[non_exhaustive].

Reasoning: Layer 0 is the stability contract. Adding a variant to an enum or a field to a struct should not be a breaking change. #[non_exhaustive] forces downstream code to handle unknown variants (with _ => arms) and prevents struct literal construction (forcing use of constructors or builder methods). This gives Layer 0 the freedom to evolve without breaking every implementation.

Why operators declare effects instead of executing them

Decision: OperatorOutput.effects contains Vec<Effect> – the operator declares side-effects but does not execute them.

Reasoning: The same operator code must work in radically different execution contexts. An operator running in-process has its effects executed immediately by the caller. An operator running inside a Temporal activity has its effects serialized and executed by the workflow engine. If the operator executed effects directly, it would be coupled to its execution context.

The effect declaration pattern makes operators pure functions over data: input in, output + effects out. The calling layer decides execution semantics.

Why the Provider trait is not object-safe

Decision: The Provider trait (in skg-turn) uses RPITIT and is not object-safe. It is never used behind dyn Provider.

Reasoning: Provider implementations are performance-critical – they make HTTP calls to LLM APIs. The zero-cost abstraction of RPITIT (no heap allocation for the future) is worth the restriction of not using dyn Provider. The object-safe boundary is one layer up: a concrete operator wrapper (generic over P: Provider) implements dyn Operator. The generic type parameter is erased at the protocol boundary.

This is the general pattern: internal implementation traits can be generic and non-object-safe for performance. Protocol traits must be object-safe for composition. The bridge between them is a concrete type that is generic internally but implements an object-safe trait externally.

Keyboard shortcuts

skelegent — Composable Agentic AI Architecture in Rust