Invocation execution modellink

Authored June, 2022

This documents the behavior of the user-visible invocation mechanism IREE uses to schedule program execution. Internally IREE uses a very similar modeling for tracking its internal workloads and in kind carries that down to target APIs and devices that themselves use a very similar model. The intent is to expose the device model in an abstracted way that allows for the full capture and communication of the execution intent to be propagated to the hardware that executes it. Though here we focus on the user-visible portion of execution there is really only one "IREE execution model" and the entire stack follows the same design. At its core this design is just an instantiation of an out-of-order execution algorithm such as those originating from the 1960's.

Glossarylink

stateDiagram
    state UserApplication {
      direction BT
      state Context0 {
        ModuleA-->ModuleAState0
        ModuleB-->ModuleBState0
      }
      state Context1 {
        ModuleA-->ModuleAState1
        ModuleB-->ModuleBState1
        ModuleC-->ModuleCState1
      }
      state ModuleA {
        @func1
        @func2
      }
      state ModuleB {
        @func3
        @func4
      }
      state ModuleC {
        @func5
      }
    }

Programlink

An IREE program is a collection of modules instantiated in a context from which invocations can be made. Invocations are ordered on a user-controlled timeline that uses fences to define the execution order requirements to enable out-of-order execution. A hosting user application may have multiple programs or multiple instances of the same program available and running invocations at a time across multiple timelines.

Modulelink

Modules define executable code and data that can be loaded, linked, and run à la ELF shared libraries. Modules may be implemented as C/C++, generated bytecode or C sources from the IREE compiler, or any other mechanism that can run code and implement the iree_vm_module_t interface. Modules on their own are read-only and can be reused across many contexts.

Traditional ML runtimes would use a model (graph, etc) as their module representation. In IREE everything is a module including runtime subsystems like the HAL and user-provided custom code. This ensures that anything IREE can do can be externalized and replaced by users without needing to modify the core IREE code.

Contextlink

A collection of modules are linked and instantiated in a context. Each context operates independently and carries its own copies of mutable module state. Invocations execute within a context scope and hosting applications coordinate across contexts as required. Contexts are cheap to create (microseconds) and retain (~100B + program state) such that users can decide how to manage them based on their scenario.

Traditional ML runtimes would call these "sessions" but in IREE everything is a program. Whether the program is stateful or stateless and how the program is invoked is up to the program author.

Invocationlink

An invocation represents a single call into a module exported function using the program state stored in a context. Users can decide whether to perform synchronous blocking invocations or asynchronous non-blocking invocations per-call; the behavior of the invocation is independent from the target function and a user program may contain a mix of both.

As an example a user program may synchronously invoke a @query_output_shapes function to preallocate storage for an asynchronous @execute_in_place function to write into.

Timelinelink

A timeline represents the observable order of execution. Users define their own timelines and communicate them to IREE via fences. Timelines do not match up with the order of invocations unless the user dictates they must by way of fences. In the absence of fences all invocations execute in an arbitrary order and they may execute concurrently just as threads in C with no barriers.

Each timeline can be thought of as an independent clock domain that may operate asynchronously at its own frequency with only fences acting to tie separate timelines together. This directly mirrors real hardware constraints like clock domain crossing as each execution scope (thread on core, driver calls to queues, kernel queues to device queues, device queues to compute unit queues, etc) is naturally operating at different rates and well-designed systems must tolerate that variability.

Fencelink

A fence is a specific point of progress in one or more timelines acting as a barrier, fork, or join point. Fences only guard execution ordering and not any particular resources though users can use them to guard resources by defining when in time the resources are available for use.

Waits on fences are wait-until operations specifying that the timeline must reach at least a specific point. This allows for flexible reordering and deferral of execution as executors can pull forward scheduled work based on policy (run similar work together, etc).

Hardware Abstraction Layer (HAL)link

The HAL is an optional feature of IREE that is used to provide a consistent interface across execution resources. It is used internally by IREE programs to define and submit work to devices and signal across them but may also be used by users to directly interface with hardware in a compatible way. Exposing the HAL API allows for users to efficiently manage their data and custom execution without expensive marshaling. Most users will only interact with HAL buffers as they work with their data but more advanced integrations can directly insert IREE into existing device contexts to transparently share scheduling and resources or insert their own code into IREE to pipeline custom execution.

Execution by Timelineslink

NOTE: this defines an execution scheme that IREE supports but a user may use one or more such schemes in a single program - just as a C application may mix single- and multi-threaded code within itself for different components.

The combination of invocations, timelines, and fences allows users to provide future knowledge to lower layers of the system by declaring their availability requirements and the lower layers are then able to execute the work out-of-order so long as the specified requirements are met. The primary goal when designing for such a system is to specify as few requirements as possible in order to provide the maximum amount of scheduling freedom to the implementation.

This makes timelines one of the most critical components of the interface. The purpose of invocations is to schedule work against one or more timelines and what happens within the invocations is an implementation detail of the program.

Sequential Executionlink

Here we say "a user invokes a function to schedule execution on a timeline" vs. a more traditional "a user invokes a function to execute work" and this manifests in the IREE ABI as invocations taking fences defining specific points on timelines of which the user may observe:

# Fences are effectively just timeline + integer tuples and are cheap to hold.
wait_fence = my_timeline.at(t)
signal_fence = my_timeline.at(t+1)
# Schedule work against the timeline.
# All work prior to t must complete before execution can occur and after
# execution the timeline will advance to t+1.
async_invoke(@some_fn, wait_fence, signal_fence)
# The invocation may have returned immediately after the work was scheduled;
# until the fence is reached no actual execution may have occurred. To
# synchronize the user code with the timeline the user can block until the fence
# is reached.
signal_fence.wait()

To the user this would appear as:

sequenceDiagram
    User->>@some_func: invoke
    activate @some_func
    @some_func->>User: ;
    @some_func-->>@some_func: wait t
    @some_func-->>User: signal t+1
    deactivate @some_func

This means from the user's perspective the actual operations performed by the invocation are not important: the only thing the user can observe in this situation is when the timeline reaches t+1 as they specified. Whether internally the invocation needs many steps to complete as there are timelines internal to the program is an implementation detail. Actual execution may look like this:

sequenceDiagram
    User->>@some_func: invoke
    activate @some_func
    @some_func->>User:  ;
    @some_func->>@some_func: ;
    @some_func-->>Device A: ;
    Device A-->>Device A: wait t
    activate Device A
    @some_func->>@some_func: ;
    @some_func-->>Device B: ;
    activate Device B
    @some_func->>@some_func: ;
    Device A-->>@some_func: ;
    deactivate Device A
    @some_func->>@some_func: ;
    @some_func-->>Device B: ;
    activate Device B
    deactivate @some_func
    Device B-->>User: signal t+1
    deactivate Device B
    deactivate Device B

Even in this simple user-synchronous example the system is able to internally run several concurrent timelines with a minimal number of synchronization points and the lowest possible latency as the user is immediately notified without any intermediate layers needing to be woken, scheduled, executed, and passed on.

Pipelined Executionlink

The true power of timelines comes from the ability to pipeline execution. Users define DAGs with fences and can construct arbitrarily complex execution topologies whether from the same program or across multiple programs:

stateDiagram
    direction LR
    state fence0 <<fork>>
    [*] --> fence0
    fence0 --> @fn0
    state fence1 <<fork>>
    @fn0 --> fence1
    fence1 --> @fn1
    fence1 --> @fn2
    state fence2 <<join>>
    @fn1 --> fence2
    @fn2 --> fence2
    @fn3 --> fence2
    fence0 --> @fn4
    @fn4 --> fence2
    fence2 --> [*]

This is a simple extension to the synchronous example using the same primitives:

# Timeline is defined by the user.
fence_a = my_timeline.at(t)
fence_b = my_timeline.at(t+1)
fence_c = my_timeline.at(t+2)
# Invocations are launched using the fences and may not complete immediately.
async_invoke(@fn0, fence_a, fence_b)
async_invoke(@fn1, fence_b, fence_c)
async_invoke(@fn2, fence_b, fence_c)
async_invoke(@fn3, None, fence_c)
async_invoke(@fn4, fence_a, fence_c)
# Blocking here but no need to; could pass fence_c on to other invocations.
fence_c.wait()

The critical point of this being that the user never had to wait for any particular invocation to complete before being able to schedule more work against the timeline, even if those invocations could themselves not complete synchronously. The lower layers of the system are able to fully model the execution as early as possible without needing to communicate (and importantly synchronize) with the user.

I/Olink

Users define the semantics of their programs themselves. For example if the user knows the precise shape of an output buffer they can preallocate the buffer and pass it in. If they don't know they can decide to factor out the shape calculation and invoke that synchronously in order to compute the shape, allocate the appropriately sized buffer, and pass that in. Or they could decide to only deal with synchronous invocations and return a program-allocated buffer view with the appropriate shape in their callback. IREE does not dictate the design of user programs and as such enables mixed stateful/stateless, asynchronous/synchronous, and arbitrary scheduling models (enqueue/drain, windowing, etc).

Inputs and outputs to invocations are provided by the user as primitive values (integers, floats, etc), supported builtin types (lists, byte buffers/strings), custom user types, and HAL types like buffers or buffer views (buffers + shape and type metadata). One or more wait fences can be used to order invocation access to one or more inputs by indicating that the resource is not available until a certain fence is reached. Similarly one or more signal fences can be used to order subsequent access to the resources by indicating the advancement of the timeline when they are available.

# wait_fence_a must be reached before buffer_a and buffer_b can be read.
# wait_fence_b must be reached before buffer_c can be read.
# buffer_a will be ready to read when signal_fence_a has been reached.
async_invoke(@fn,
             (wait_fence_a, buffer_a, buffer_b),
             42,  # no ordering required on value types
             (wait_fence_b, buffer_c),
             (signal_fence_a, buffer_a))

The above example demonstrates an in-place operation on buffer_a. It's also possible for invocations to return values:

result = invoke(@sum, 1, 2)  # = 3

When executed asynchronously a callback or any construct that can be built upon them (like promises/futures) can receive the results:

def my_callback(result):
  print(result)  # 3
async_invoke(@sum, 1, 2, my_callback)

Stream-ordered Allocationslink

Invocations generally have only a few KB of overhead and pipelined command buffers take only a small amount more. Storage buffers, however, can easily take hundreds of MB per invocation for I/O and transient state. This compounds as program usage becomes more complex or multiple programs are involved. IREE supports traditional host-ordered allocations (à la malloc/free) for persistent buffers like large constants/read-only data or user-managed ringbuffers. Stream-ordered allocations are also supported to allow for pooled buffer reservations that can be allocated in a scheduled order alongside program execution.

For more detailed examples see the CUDA blog posts describing their implementation: part 1, part 2.

With stream-ordered allocations each allocation and deallocation operation is scheduled with wait and signal fences just as with invocations. This allows these allocation operations to execute remotely on device without host program involvement. For example, scheduling alloca0/dealloca0 and alloca1/dealloca1 interleaved with the function execution allows for the transient memory required for executing @fn0 to remain uncommitted until immediately before it is executed, committed during execution, and then decommitted immediately after execution. The memory required for passing data from @fn0 to the subsequent @fn1 and @fn2 survives until after they have completed executing before being decommitted. By using the same scheduling primitives as execution the allocation topology can be as arbitrarily complex as the invocation topology:

stateDiagram
    direction LR
    state fence0a <<fork>>
    [*] --> fence0a
    state fence0b <<fork>>
    fence0a --> alloca0
    fence0a --> alloca1
    alloca0 --> fence0b
    alloca1 --> fence0b
    fence0b --> @fn0
    state fence1a <<fork>>
    @fn0 --> fence1a
    state fence1b <<fork>>
    fence1a --> dealloc0
    dealloc0 --> fence1b
    fence1b --> @fn1
    fence1b --> @fn2
    state fence2a <<join>>
    @fn1 --> fence2a
    @fn2 --> fence2a
    state fence2b
    fence2a --> dealloc1
    state fence2b <<join>>
    dealloc1 --> fence2b
    fence2b --> [*]

When operating in this way allocations from the host-perspective are just reservations for a slice of pooled storage that will be committed at some point in the future. Likewise deallocations from the host-perspective release the prior reservation and schedule the paired decommit at some point in the future. Scheduling N sequential invocations thus requires only enough committed storage for a single invocation in addition to the I/O (unless that too is stream-ordered).

This scheduling behavior allows for both minimal peak memory consumption regardless of the number of programs or invocation pipeline depth and sharing of committed storage across programs: the memory consumption of a program at rest is near zero when stateless and the sum of all state when stateful. Target devices that natively support stream-ordered allocations (like CUDA) can even share pools across processes.

The other provided feature in combination with the fence guaranteed forward progress is that so long as the memory pool can service a single request execution can still continue even when constrained. A device can serialize two independent invocations requiring 400MB of transient memory when the system only has 512MB available with no user-visible impact besides increased latency. This does require the user to ensure they schedule work that is possible to run or rely on the target system having paging in order to lighten the strictness of the pool quotas.

Stream-ordered allocations performed by the user for invocation inputs can be declared as transferred to the program. This allows the program to eagerly deallocate or reuse the input storage while still preserving the internal scheduling requirements of the program.

Internal Statelink

A stateful program may contain internal timelines that it uses to order its own execution. Take for example this simple stateful program:

class TrivialKernel(Program):
  _x0 = Program.export_global(x_type)
  def get(self):
    return self._x0
  def set(self, x=x_type):
    self._x0 = x
  def matmul(self, x=y_type):
    self._x0 = self._matmul(x, self._x0)
  @Program.kernel
  def _matmul(x, x0):
    return jnp.matmul(x, x0)

Each invocation of matmul needs to be executed in-order with prior invocations as there is a data dependency established on self._x0. Attempts to get or set must also be sequenced correctly with the matmul invocations. A basic usage like this:

m = TrivialKernel()
m.set(input)
m.matmul(a)
m.matmul(b)
m.matmul(c)
output = m.get()
print(output)  # implicit wait

Would be executed as:

sequenceDiagram
    activate User
    User->>TrivialKernel: @set(input)
    activate TrivialKernel
    TrivialKernel-->>Device: ;
    deactivate TrivialKernel
    activate Device
    TrivialKernel->>User: ;
    User->>TrivialKernel: @matmul(a)
    activate TrivialKernel
    TrivialKernel-->>Device: ;
    deactivate TrivialKernel
    TrivialKernel->>User: ;
    User->>TrivialKernel: @matmul(b)
    activate TrivialKernel
    TrivialKernel-->>Device: ;
    deactivate TrivialKernel
    TrivialKernel->>User: ;
    User->>TrivialKernel: @matmul(c)
    activate TrivialKernel
    TrivialKernel-->>Device: ;
    deactivate TrivialKernel
    TrivialKernel->>User: ;
    User->>TrivialKernel: @get()
    activate TrivialKernel
    TrivialKernel-->>Device: ;
    deactivate TrivialKernel
    TrivialKernel->>User: ;
    Device-->>Device: ;
    deactivate User
    User->>User: (wait)
    Device-->>User: (signal)
    deactivate Device
    activate User
    User->>User: print(output)
    deactivate User

Note that although the user provided no timeline of their own execution is still ordered correctly due to the internal timeline constructed by the program. If the user wanted to also pipeline execution with another program they could do so by providing their own fences.