Advanced Usage

Tuning

This documents gathers the strategies users can take advantage of when tuning and optimizing OMPC applications.

Threads

Currently, the number of threads from the head process must match the number of execute event handlers of all workers to get the best performance from the runtime. Our suggestion is to set the environment variable LIBOMP_NUM_HIDDEN_HELPER_THREADS to OMPCLUSTER_NUM_EXEC_EVENT_HANDLERS * num workers. This is a known limitation of our runtime and we are working on fixing it upstream (see here) and in one of our next releases.

Scheduler

In order to map target tasks to devices (i.e. worker nodes), the OmpCluster runtime uses the HEFT scheduling algorithm by default.

The Round-Robin scheduling algorithm is also available. This algorithm is fast to execute but generally produces bad schedules. Still, the users can experiment with it by setting the following environment variable:

export OMPCLUSTER_SCHEDULER="roundrobin"

Setting this variable will use round-robin instead of heft the next time your application is executed.

Blocking Scheduler

By default, the OMPC scheduler allows multiple target tasks to be mapped to a worker at a time. This is useful to use all cores of each worker, especially if it does not have parallel computations within the tasks.

However, it might not be the ideal behavior for all applications, especially when target tasks already perform parallel computations (e.g. a parallel for within a target nowait). In that case, the blocking behavior can be enabled for any scheduling strategy (round_robin, heft, etc) by setting the following environment variable:

export OMPCLUSTER_BLOCKING_SCHEDULER=1

When set, the scheduler behaves as a blocking scheduler: it only allow a single target task to be mapped to a worker ate a time. This is useful to avoid competition in using the hardware resources of the workers.

Tuning HEFT

HEFT is a heuristic-based list-scheduling algorithm that makes decisions based on:

  • Computation cost: how many time units is required for executing a single task;

  • Communication cost: how many time units for transferring data between two tasks.

Unfortunately, the runtime does not know have this information ahead of time and expects the user to provide them via environment variables. Take a look at the example below. The first variable, OMPCLUSTER_HEFT_COMP_COST indicates how long does it take to execute a task. Secondly, the variable OMPCLUSTER_HEFT_COMP_COEF specifies a coefficient in relation to the computation cost. For example, a coefficient of 2 indicates that communication costs twich as much as computation. The actual units here (miliseconds, seconds, minutes) are not as important as the relationship between computation and communication cost.

# Computation cost in time units (could be miliseconds, seconds, ...)
# E.g. computation of a task takes 30 time units
export OMPCLUSTER_HEFT_COMP_COST="30"

# Communication cost as a coefficient of computation cost
# E.g. communication of a dependency takes 2x as much time as a computation.
export OMPCLUSTER_HEFT_COMP_COEF="2.0"

Tip: Adjusting the costs may required some experimenting and iteration from the user-side in order to find a good balance. Dumping the task graph (see next section) can also help here.

Note: The OmpCluster runtime does not yet support tasks and dependencies with different costs. If your application fits this category we suggest you set an average value to cover both cases.

Dumping the Task Graph

If you wish to inspect the final scheduled graph you can use the OMPCLUSTER_TASK_GRAPH_DUMP_PATH to specify a path where to dump using the GraphViz dot language.

# Specify a path where the task graph should be dumped (required)
export OMPCLUSTER_TASK_GRAPH_DUMP_PATH="<path>/<file_prefix>"

# Show the edge weights in the graph (optional)
export OMPCLUSTER_HEFT_DUMP_EDGE_LABEL=1

Dumping internal HEFT data

Note: This option is useful for developers who are looking to troubleshoot the execution of the algorithm itself. Reguler users can safely ignore this option.

The following environment variable will instruct the HEFT scheduler to dump its internal state before the application exists. This is useful if you are looking to inspect the EST, EFT, AST, AFT tables.

export OMPCLUSTER_HEFT_LOG="/path/to/heft.log"

Environment variables

This section describes the environment variables than can be used to tune the runtime.

OpenMP Target Runtime

Non-exhaustive list of the settings for the LLVM OpenMP Target runtime library. The full list is available on the upstream website. Please notice some settings might differ since our version of LLVM is not synchronized with the last version of LLVM.

LIBOMPTARGET_INFO

The variable controls whether or not the runtime provides additional information during the execution. The output provided is intended for use by application developers. It is recommended to build your application with debugging information enabled, this will enable filenames and variable declarations in the information messages. More information on how to use this environment variable is available here.

LIBOMPTARGET_DEBUG

The variable controls whether or not debugging information will be displayed. This feature is only available if libomptarget was built with -DOMPTARGET_DEBUG. The debugging output provided is intended for use by libomptarget developers.

LIBOMPTARGET_MEMORY_MANAGER_THRESHOLD

LIBOMPTARGET_MEMORY_MANAGER_THRESHOLD sets the threshold size for which the libomptarget memory manager will handle the allocation. Any allocations larger than this threshold will not use the memory manager and be freed after the device kernel exits. Contrary to the other libomptarget plugins, the OmpCluster runtime is using a Bump allocator with a default threshold value of 8MB. If LIBOMPTARGET_MEMORY_MANAGER_THRESHOLD is set to 0 the memory manager will be completely disabled.

LIBOMP_NUM_HIDDEN_HELPER_THREADS

The variable configures the number of threads used by the OpenMP runtime to offload target tasks. Those threads are called hidden helper threads. By default, the number of hidden helper threads is 8.

Note: Currently, the number of hidden helper threads should match the total number of execute event handlers to get the best performance.

OMPC Runtime

General settings of the OmpCluster runtime

OMPCLUSTER_PROFILE

The variable defines the path and file suffix used to dump the trace files generated by the profiler. The output is a set of JSON files with the following names path/filename_suffix + "_" + process_name + ".json".

There is no default value, if not set, the profiling is not performed and the execution traces are simply not saved to any file.

OMPCLUSTER_PROFILE_LEVEL

It controls what profiling events that will be added to the generated trace. Use -1 to print them all. Default value is 1 and should be sufficient for most users.

OMPCLUSTER_DEBUG

Sets the level of OmpCluster debug messages. This feature is only available if libomptarget was built with -DOMPTARGET_DEBUG. Default value is 0.

OMPCLUSTER_MPI_FRAGMENT_SIZE

Maximum buffer size sent in a single MPI message. If a buffer is larger than this threshold, it is automatically splitted in separated messages. The default value is 100000000 bytes (100 MiB).

OMPCLUSTER_NUM_EXEC_EVENT_HANDLERS

It controls the number of threads responsible for executing target regions spawned per process. The default value is 1.

OMPCLUSTER_NUM_DATA_EVENT_HANDLERS

It controls the number of threads responsible for data-related events spawned per process. The default value is 1.

OMPCLUSTER_EVENT_POLLING_RATE

Polling rate used by event handlers in microseconds. Small values reduce waiting time between checks, but increases CPU usage. Default value is 1 us.

OMPCLUSTER_BCAST_STRATEGY

Sets how OmpCluster transfers data that must be broadcast (i.e. sent to all nodes).

Mode

Description

disabled

Data on synchronous target data map regions will be sent to each device when needed by the memory management system.

p2p

Data on synchronous target data map regions will be sequentially sent to every device through peer-to-peer communication.

mpibcast

Data on synchronous target data map regions will be sent to every device using MPI_Bcast.

dynamicbcast

Data on synchronous target data map regions will be sent to every device using the Dynamic Broadcast algorithm.

Default value is disabled.

OMPCLUSTER_ENABLE_PACKING

If set to 1 enables communication events to be packed, while a value of 0 does not. When enabled, the runtime packs communication metadata and buffers that have less than OMPCLUSTER_PACKING_THRESHOLD bytes. Dafault value is 0.

OMPCLUSTER_PACKING_THRESHOLD

The value passed to OMPCLUSTER_PACKING_THRESHOLD defines the maximum size of the buffers that will be packed in a single MPI message. If OMPCLUSTER_PACKING_THRESHOLD=0 and OMPCLUSTER_ENABLE_PACKING=1, then only the communication metadata is packed. If OMPCLUSTER_ENABLE_PACKING is set to 0, OMPCLUSTER_PACKING_THRESHOLD is not used. Default value is 0.

OMPC Scheduler

Settings of the OmpCluster runtime specific to the task scheduler

OMPCLUSTER_SCHEDULER

Selects the strategy used by the scheduler to assign tasks to specific MPI processes. Currently, there are three available options, described below.

Scheduler

Description

roundrobin

Dynamic round-robin: schedules task when executing them by continuously going over the list of processes

graph_roundrobin

Static round-robin: schedules task when creating them by continuously going over the list of processes

heft

Heterogeneous Earliest Finish Time (HEFT): more advanced heuristic that takes into account the communication time. See wikipedia for more details.

Default value is heft.

OMPCLUSTER_BLOCKING_SCHEDULER

This variable is used to configure the blocking behavior of the OMPC scheduler. A value of 0 enables multiple target regions (target tasks) to be executed at the same time in parallel by each MPI process, and a value of 1 does not. Default value is 0.

Please notice that the dynamic round-robin strategy will not assign a task to a busy process (already executing a task) if the blocking behavior is enabled.

OMPCLUSTER_TASK_GRAPH_DUMP_PATH

Path used to dump the task graph generated by the scheduler, indicating to which MPI rank each task was assigned. The output is a dot file.

There is no default value, if not set, the task graph is simply not saved to any file.

HEFT parameters

Parameters used by the HEFT scheduler. Only used if OMPCLUSTER_SCHEDULER=heft. Those parameters are especially useful since the runtime is not yet able to predict the communication and computation time of the tasks.

OMPCLUSTER_HEFT_COMM_COEF

Coefficient of the communication cost. Having a higher coefficient means that communication will have more weight in relation to computation. Default value is 1.

OMPCLUSTER_HEFT_COMP_COST

Default computation cost of tasks. Default value is 100.

OMPCLUSTER_HEFT_COMM_COST

Default communication cost of tasks. Default value is 1.

Fault Tolerance

Settings of the OmpCluster runtime specific to the fault tolerance.

OMPCLUSTER_FT_DISABLE

OMPCLUSTER_CP_USEVELOC

OMPCLUSTER_CP_EXECCFG

OMPCLUSTER_CP_TESTCFG

OMPCLUSTER_HB_TIMESTEP

OMPCLUSTER_HB_TIMEOUT

OMPCLUSTER_HB_PERIOD

OMPCLUSTER_CP_MTBF

OMPCLUSTER_CP_WSPEED