vigoo's software development blog

Golem 1.5 features - Part 17: Semantic retry policies

Posted on April 24, 2026

Introduction

I am writing a series of short posts showcasing the new features of Golem 1.5, to be released at the end of April, 2026. The episodes of this series will be short and assume the reader knows what Golem is. Check my other Golem-related posts for more information!

Parts released so far:

Retry policies

Previous Golem versions had a simple, global retry policy describing a few retries with exponential backoff. This retry policy applied to everything - in case a Golem application failed with an error (for example a Rust panic) it recreated the instance according to this policy a few times before it marked the agent to be permanently failed. This in theory could make transient errors auto-resolved but it did not allow much control to the user (although the parameters of this global policy could be overridden) and the exact behavior depended heavily on how exactly (and when) the application crashes with an error.

Golem 1.5 improves this in two ways:

Inline retries

In many cases throwing away the agent instance, and recreating its state from scratch is not necessary for a retry. Golem now transparently retries all these transient issues with HTTP requests etc. immediately, providing a much faster and reliable retry mechanism.

Another part of this change is better classification of what can be retried and what not. We no longer retry things that are known to be deterministically fail again.

Retry policies

The new retry policies are much more flexible and customizable than the single global configuration we had before. Just like with secrets and resource quotas, retry policies are also defined per-environment. An environment can have an arbitrary number of named retry policies defined. Each has a predicate and a policy - the predicate is an expression that decides whether the policy should be applied for a given failure case or not. If it gets chosen (evaluation order can be controlled with a priority field), the applied retry strategy is described by the policy.

Policies

The policy is a highly composable structure, with the following base nodes controlling delays:

PolicyDescription
periodicFixed delay between each attempt
exponentialbaseDelay × factor^attempt — exponentially growing delays
fibonacciDelays follow the Fibonacci sequence starting from first and second
immediateRetry immediately (zero delay)
neverNever retry — give up on first failure

These can be combined with various combinators:

CombinatorDescription
countBoxLimits the total number of retry attempts
timeBoxLimits retries to a wall-clock duration
clampClamps computed delay to a [minDelay, maxDelay] range
addDelayAdds a constant offset on top of the computed delay
jitterAdds random noise (±factor × delay) to avoid thundering herds
filteredOnApply the inner policy only when a predicate matches; otherwise give up
andThenRun the first policy until it gives up, then switch to the second
unionRetry if either sub-policy wants to; pick the shorter delay
intersectRetry only while both sub-policies want to; pick the longer delay

Predicates

Predicates are boolean expressions evaluated against the error context properties. They can be composed with and, or, and not:

PredicateDescription
propEq / propNeqEquality / inequality comparison
propGt / propGte / propLt / propLteNumeric comparisons
propExistsTrue if the named property exists in the context
propInTrue if the property value is in a given set
propMatchesGlob pattern matching on the text representation
propStartsWith / propContainsPrefix / substring matching
true / falseConstant predicates

Available properties

The properties are what predicates can refer to to evaluate if a policy applies to a failure case. Golem defines the following ones:

PropertySet byTypeDescription
verbAlltextHTTP method, "invoke", "resolve", "trap", etc.
noun-uriAlltextTarget URI (e.g. https://…, worker://…, kv://…)
uri-schemeAll with URItextScheme part of the URI
uri-hostAll with URItextHostname
uri-portWhen port presentintegerPort number
uri-pathAll with URItextPath portion
status-codeHTTP responsesintegerHTTP response status code
error-typeHTTP errorstextError classification string
functionRPCtextTarget function name
target-component-idRPCtextComponent ID of the target worker
target-agent-typeRPCtextAgent type name of the target
db-typeRDBMStextDatabase type (from URI scheme)
trap-typeWASM trapstext"deterministic" or "transient"

Defining policies

Retry policies can be defined in the application manifest under the retryPolicyDefaults section, keyed by environment name. These retry policies are default values set when the application is deployed, and they can be manipulated live while agents are already using them.

retryPolicyDefaults:
  my-environment:
    # Never retry client errors (4xx)
    no-retry-4xx:
      priority: 20
      predicate:
        and:
          - propGte: { property: status-code, value: 400 }
          - propLt: { property: status-code, value: 500 }
      policy: "never"

    # Aggressive retry for known-transient HTTP errors
    http-transient:
      priority: 10
      predicate:
        propIn:
          property: status-code
          values: [502, 503, 504]
      policy:
        countBox:
          maxRetries: 5
          inner:
            jitter:
              factor: 0.15
              inner:
                clamp:
                  minDelay: "100ms"
                  maxDelay: "5s"
                  inner:
                    exponential:
                      baseDelay: "200ms"
                      factor: 2.0

    # Catch-all: moderate retry for everything else
    catch-all:
      priority: 0
      predicate: true
      policy:
        countBox:
          maxRetries: 3
          inner:
            exponential:
              baseDelay: "100ms"
              factor: 3.0

When the agent encounters an error, policies are evaluated in descending priority order. In this example:

  1. First check no-retry-4xx (priority 20) — if it's a 4xx error, give up immediately
  2. Then check http-transient (priority 10) — if it's a 502/503/504, retry aggressively
  3. Fall through to catch-all (priority 0) — retry moderately for anything else

Default retry policy

When no user-defined retry policies are set, a default catch-all policy is activated which behaves the same way as the old global retry policy did.

In the new policy system, this would be defined like the following:

name: default
priority: 0
predicate: true
policy:
  countBox:
    maxRetries: 3
    inner:
      jitter:
        factor: 0.15
        inner:
          clamp:
            minDelay: "100ms"
            maxDelay: "1s"
            inner:
              exponential:
                baseDelay: "100ms"
                factor: 3.0

Live-editing policies

As mentioned above, the default policies created by golem deploy can be modified on the fly using the CLI (or the REST API). This can be useful to react to production issues by tweaking retry behaviors without having to redeploy anything.

The following examples show how the CLI can be used to manipulate these policies:

# Create a new policy
golem retry-policy create http-transient \
  --priority 10 \
  --predicate '{ propIn: { property: "status-code", values: [502, 503, 504] } }' \
  --policy '{ countBox: { maxRetries: 5, inner: { exponential: { baseDelay: "200ms", factor: 2.0 } } } }'

# List all policies in the current environment
golem retry-policy list

# Get a specific policy by name
golem retry-policy get http-transient

# Update an existing policy (e.g. raise its priority)
golem retry-policy update http-transient --priority 15

# Delete a policy
golem retry-policy delete http-transient

SDK support

The Golem SDK provide the same runtime query and modification capabilities. We can query retry policies, modify or create new ones, and have temporary overrides to them that are only affecting the running agent.

The following example uses the Golem SDK to define and use a custom retry policy for a given HTTP request:

import {
  Policy, Predicate, NamedPolicy, Props, Duration,
  withRetryPolicy,
} from '@golemcloud/golem-ts-sdk';

const policy = NamedPolicy.named(
  'http-transient',
  Policy.exponential(Duration.milliseconds(200), 2.0)
    .clamp(Duration.milliseconds(100), Duration.seconds(5))
    .withJitter(0.15)
    .onlyWhen(Predicate.oneOf(Props.statusCode, [502, 503, 504]))
    .maxRetries(5),
)
  .priority(10)
  .appliesWhen(Predicate.eq(Props.uriScheme, 'https'));

// Scoped usage — policy is restored when the block exits
withRetryPolicy(policy, () => {
  // HTTP calls in this block use the custom retry policy
  makeHttpRequest();
});
use golem_rust::retry::*;
use std::time::Duration;

let policy = NamedPolicy::named(
    "http-transient",
    Policy::exponential(Duration::from_millis(200), 2.0)
        .clamp(Duration::from_millis(100), Duration::from_secs(5))
        .with_jitter(0.15)
        .only_when(Predicate::one_of(Props::STATUS_CODE, [502_u16, 503, 504]))
        .max_retries(5),
)
.priority(10)
.applies_when(Predicate::eq(Props::URI_SCHEME, "https"));

// Scoped usage — policy is restored when the block exits
with_named_policy(&policy, || {
    // HTTP calls in this block use the custom retry policy
    make_http_request();
})?;
import golem.Guards._
import golem.host.Retry._

import scala.concurrent.duration._

val policy = named(
  "http-transient",
  Policy.exponential(200.millis, 2.0)
    .clamp(100.millis, 5.seconds)
    .withJitter(0.15)
    .onlyWhen(Props.statusCode.oneOf(502, 503, 504))
    .maxRetries(5)
).priority(10)
 .appliesWhen(Props.uriScheme.eq("https"))

// Scoped usage — policy is restored when the Future completes
withRetryPolicy(policy) {
  Future {
    // HTTP calls in this block use the custom retry policy
    makeHttpRequest()
  }
}
let policy =
  NamedPolicy::named(
    "http-transient",
    Policy::exponential(Duration::millis(200), 2.0)
      .clamp(Duration::millis(100), Duration::seconds(5))
      .with_jitter(0.15)
      .only_when(
        Predicate::one_of(
          Props::status_code(),
          [Value::int(502), Value::int(503), Value::int(504)],
        ),
      )
      .max_retries(5),
  )
    .priority(10)
    .applies_when(Predicate::eq(Props::uri_scheme(), Value::text("https")))

// Scoped usage — policy is restored when the block exits
with_named_policy!(policy, fn() {
  // HTTP calls in this block use the custom retry policy
  make_http_request()
})

Extensionability

Any future retry-capable host functionality we add to Golem can be integrated into this retry policy system, and with the ability of querying the policies runtime, third party, user-level retry functionalities can also be built on top of it.