11. Agent cache
11.1. Overview
Each ArmoniK agent maintains a local, file-based cache that avoids redundant fetches from the object storage. When a task depends on a result that was recently consumed by another task on the same node, the agent reads the data directly from disk instead of going back to the remote object storage.
The cache is entirely internal to the agent process. Workers do not have direct access to it.
11.2. Two-level folder design
The agent uses two distinct filesystem paths for data management.
Path (option) |
Purpose |
Accessible by worker |
|---|---|---|
|
Persistent cache shared across tasks and agent processes |
No |
|
Per-task temporary staging folder exposed to the worker |
Yes |
11.2.1. Internal cache (InternalCacheFolder)
InternalCacheFolder is the actual cache. It is a flat directory where each file is named after
the ResultId of the result it contains. Files accumulate across task executions and are only
removed when eviction is triggered (see Cache eviction).
Because this folder is just a regular directory on the node, multiple agent processes running on the same physical or virtual node can point to the same path. This makes the cache effectively shared at the node level: if agent A already fetched a result, agent B on the same node will find it in the cache without a round trip to the object storage.
11.3. Cache lifecycle
11.3.1. Pre-processing
Before executing a task, the agent resolves the ResultId of each dependency (payload and data
dependencies) and tries to serve them from the internal cache:
For each dependency, the agent looks for a file named
<ResultId>inInternalCacheFolder.If found, the file is copied into the per-task folder so the worker can access it.
If not found, the data is fetched from the object storage into the per-task folder.
Newly fetched files are then copied into the internal cache (via a temporary file to ensure atomicity) so that future tasks on the same node can reuse them.
11.3.2. Post-processing
After the worker reports a successful output, the agent copies every result file produced by the
worker from the per-task folder into InternalCacheFolder. Downstream tasks that depend on those
results will therefore find them in the cache without hitting the object storage.
11.4. Why workers cannot access the cache
Workers only see the per-task folder. This is intentional:
Isolation: a worker should only access the data it is authorised to see for its current task.
Simplicity: the cache files are keyed by
ResultId, which is an internal ArmoniK identifier. Exposing the raw cache to workers would require them to understand ArmoniK internals.Safety: the cache is shared across tasks and potentially across agents. Allowing a worker to write to it directly could corrupt entries used by other tasks.
11.5. Cache eviction
Eviction is driven by disk usage and happens at the end of every task (during agent disposal).
The agent reads the disk usage of the filesystem that hosts
InternalCacheFolder.If
(usedSpace / totalSize) > CacheEvictionThreshold, eviction is triggered.Files are sorted by their last-access or last-write time (whichever is more recent).
Files are deleted from the oldest to the most recently accessed until the usage falls below the threshold.
Caching is disabled when CacheEvictionThreshold is set to 0 (the default).
11.6. Configuration
The cache is configured through the following environment variables:
Pollster__SharedCacheFolder=/cache/shared
Pollster__InternalCacheFolder=/cache/internal
Pollster__CacheEvictionThreshold=0.8
For a full description of each variable, see the Pollster variables documentation.