Resource: a language for coordination
This abstract description of resource is like a protocol or currency, understood by multiple parts of Grain ecosystem. These parts of system regulate their own behaviors or interact with other components with resource. Here are some major examples of how resource affects the system:
The executor / scheduler determines when and where a tasklet can be executed.
The executor and the worker gets aware that a running tasklet is exceeding time limit.
A new joining worker reports its capabilities to the manager.
A running tasklet learns about the resource it is allowed to use.
Each kind of resource is a class implementing the resource interface
One resource class actually plays two roles: requested resource and allocated resource.
Depending on the nature of the resource, they can have similar or different behaviors in
these two status (e.g. A request for Cores(3) might yield Cores([1,3,4]); a
request for Memory(8) will definitely return Memory(8)).
A user of Grain should know how to construct requested resources, while a tasklet implementer need to know how to get information from allocated resources’ attributes.
In your Python program, create resource instances for tasklets you are going to submit:
# Combine
wg.submit(Memory(8), afn, ...)
# Delayed
(dfn @ Memory(8))(...)
Resource request can contain multiple kinds of resources. Use & operator to combine
them (e.g. Node(N=1,M=8) & WTime(group='g1')). The allocated resource object will have
the attributes from each resource. Note that you cannot combine two resource of the same
kind (e.g Memory(2) & Memory(8) is invalid).
For a running tasklet, allocated resource can be accessed through context variable
grain.GVAR. grain.GVAR.res will be an allocated resource object. Its attributes
contain specific allocation information. See the reference below.
Grain config specifies resources held by a worker. Wall time, CPU cores, and virtual
memory are information related to the supercomputing job management system, so they are
speficied with config entries script.walltime, script.cores, script.memory.
These three values are used to request resources from the management system as well as
set up the worker. You can observe the implementation details with grain up --dry.
TODO: other resource in config
Grain provides several built-in resource classes, which should be enough to describe many computational tasks. Although in theory a user can provide any custom class that implements the resource interface as resource, registering custom resource class is not implemented yet.
Built-in resources
The variables starts with a period are attributes for the allocated resource.
- class grain.resource.Cores(N)
CPU Cores. As requested resource, N should be an int, specifying the number of cores
to request. For allocated resource, .c is a List[int] of allocated cores; .N is
len(.c).
- class grain.resource.Memory(M)
Virtual memory. M, .M (int or float) both stand for memory in GB.
- grain.resource.Node(N, M)
A convenient function for Cores(N) & Memory(M).
- class grain.resource.WTime(*, T=None, softT=None, group=None, countdown=False)
Walltime limit for a tasklet or a worker. T: int (in seconds) or str in format
[d]-hh:mm:ss, representing a hard time limit. softT: int, soft time limit. If
unset, default to the same value as the hard time limit.
countdown: default to false for requested and allocated resources; true for resource
providers (i.e. workers).
group: str, this provides an automatic time limit definition alternative: soft and
hard time limits for each tagged group are estimated based on the first few runs.
Current strategy: for each group, based on first 8 tasklet runs, set T = mean + 5 stdev,
softT = mean + 2 stdev; runs that take <= 10s are not counted.
A tasklet can run on a worker when the tasklet’s soft time limit is satisfied by the worker.
When run time of a tasklet exceeds the tasklet’s hard time limit, the run fails with exception
trio.TooSlowError. This timeout is handled by Grain, so a tasklet does not need to set a
timeout itself. When a worker runs out of its time, it quits.
- grain.resource.ZERO
“No resource is required to run the tasklet” / “The worker is holding no resource”. Current Grain implementation runs tasklets requiring no resource on local worker.
- class grain.resource.Token(token)
A tag for matching resource requesters to the providers. token, .token: str, the
tag name. A tasklet carrying a token can only run on a worker with the exactly matched token.
Resource interface
TODO: guide to implement a new resource class
- class grain.resource.Resource
- _request(res)
Return true if it is possible to alloc(res)
- _alloc(res)
Return allocated resources. Allocated resources should also be able to be used as request resources (i.e.
B.request(A.alloc(some_res))), but vice versa does not necessary need to be true.
- _dealloc(res)
- _repr()
__repr__, for concatenating with other resources
- _stat()
Return (p,q) where p and q are int, and p/q represents the percentage resource availability.