vigoo's software development blog

Golem 1.5 features - Part 16: Quotas

Posted on April 24, 2026

Introduction

I am writing a series of short posts showcasing the new features of Golem 1.5, to be released at the end of April, 2026. The episodes of this series will be short and assume the reader knows what Golem is. Check my other Golem-related posts for more information!

Parts released so far:

Quotas

A modern application usually depends on various third party services. This is even more true today with AI agents - an agent is going to make requests to various external systems as well as LLM providers and other AI infrastructure. Most of these have costs and limits. It does matter how many and how big requests you make to your chosen model, and many APIs have built-in quotas and rate limits affecting the callers.

In Golem 1.5 we introduced a new, general purpose feature that tracks resource quotas in a distributed Golem application. The idea is that we can define an arbitrary set of resources with a limited availability, and enforce reserving some of this pool of available resources through quota tokens. With this token-granting we can control the parallel running agents and make sure we don't over-use the limited resources.

It is possible to acquire some quota tokens and split them, passing a part of them to another Golem agent through agent-to-agent RPC calls. This is a very powerful tool to have a structured control over arbitrary external resource consumption.

Integration

In this release we just introduce this as a general tool, available through the Golem SDKs, but it is not integrated with any libraries yet. In the future we expect to have dedicated support for quota tokens in various client libraries, but for now they have to be used as a manually implemented layer on top of the clients.

Setting it up

Let's see some concrete examples. Just like with secrets, quota resources are also defined per environment, and we can define them with their initial limits in the application manifest.

The following manifest snippet defines three different types of resources:

resourceDefaults:
  prod:
    - name: api-calls
      limit:
        type: Rate
        value: 100
        period: minute
        max: 1000
      enforcementAction: reject
      unit: request
      units: requests
    - name: storage
      limit:
        type: Capacity
        value: 1073741824       # 1 GB
      enforcementAction: reject
      unit: byte
      units: bytes
    - name: connections
      limit:
        type: Concurrency
        value: 50
      enforcementAction: throttle
      unit: connection
      units: connections

The prod is just an example environment. For each environment we can define an arbitrary number of resources, identified by a unique name. The most important in their configuration are limit and enforcementAction.

We have three different limit types:

In addition to the limits, we can select what to do when the token request cannot be satisfied:

Dynamic changes

The resource definitions in the application manifest are only defaults applied at deployment. With CLI commands (or REST API calls) we can modify them any time, and the changes will affect running agents immediately.

For example we can increase the above defined connections resource's limit to 100 if we need more concurrency:

$ golem resource update connections --limit '{"type":"concurrency","value":100}' --environment prod

Code example

Let's see how we can use these resources from code!

Initialization

The first step is to acquire a quota token interface for every resource our agent is going to need. This can be done in the agent's constructor, or the first time the token is needed, but should be done only once:

import { acquireQuotaToken } from "golem-ts-sdk";

const token = acquireQuotaToken("api-calls", 1n);
use golem_rust::quota::QuotaToken;

let token = QuotaToken::new("api-calls", 1);
import golem.host.QuotaApi._

val token = QuotaToken("api-calls", BigInt(1))
let token = @quota.QuotaToken::new("api-calls", 1UL)

The parameter (1) is the expected amount asked per reservation. For a simple rate limiting use case, where we associate 1 API call with 1 token, this can be 1.

Simple rate limiting

For a simple rate limiting case, we can reserve one token for each API call (of course we could also weight different API calls differently, by associating different token counts to them):

import { withReservation } from "golem-ts-sdk";

const result = await withReservation(token, 1n, async (reservation) => {
  const response = await callSimpleApi();
  return { used: BigInt(1), value: response };
});
use golem_rust::quota::with_reservation;

let result = with_reservation(&token, 1, |_reservation| {
    let response = call_simple_api();
    (1, response)
});
val result = withReservation(token, BigInt(1)) { reservation =>
  callSimpleApi().map { response =>
    (BigInt(1), response)
  }
}
let result = @quota.with_reservation(token, 1UL, fn(reservation) {
  let response = callSimpleApi()
  (1, response)
})

Rate-limiting LLMs

The same mechanism can be used to rate-limit for example LLMs, but not based on just the requests, but on the actual tokens consumed. Instead of reserving just one token, we reserve the number of maximum tokens we expect the request will consume (and in most LLM APIs we can enforce this). Then in the response we read how much actual tokens our request used, and commit that (returning the final number the withReservation helper is a way to commit, but there is also an explicit commit call we can use).

Splitting and merging

Quota tokens can be split, merged and transformed between agents. The following example splits off 200 units from our agent's available tokens for a given resource, and sends it to another agent.

const childToken: QuotaToken = token.split(200n);

const childAgent = await SummarizerAgent.newPhantom();
const summary = childAgent.summarize(text, childToken);
let child_token: QuotaToken = token.split(200);

let child_agent = SummarizerAgent::new_phantom().await;
let summary = child_agent.summarize(text, child_token).await;
val childToken: QuotaToken = token.split(BigInt(200))

for {
  childAgent <- SummarizerAgent.newPhantom()
  summary <- childAgent.summarize(text, childToken)
} yield summary
let child_token: QuotaToken = self.token.split(200UL)

let child_agent = SummarizerAgent::new_phantom()
child_agent.summarize(text, child_token)

In addition to this, we could return the split tokens after the remote call, and merge them back into the original agent's tokens:

token.merge(returnedToken)
token.merge(returned_token)
token.merge(returnedToken)
token.merge(returned_token)