Advanced Usage¶
Tuning¶
This documents gathers the strategies users can take advantage of when tuning and optimizing OMPC applications.
Threads¶
Currently, the number of threads from the head process must match the number of execute event handlers of all workers to get the best performance from the runtime. Our suggestion is to set the environment variable LIBOMP_NUM_HIDDEN_HELPER_THREADS
to OMPCLUSTER_NUM_EXEC_EVENT_HANDLERS
* num workers. This is a known limitation of our runtime and we are working on fixing it upstream (see here) and in one of our next releases.
Scheduler¶
In order to map target
tasks to devices (i.e. worker nodes), the OmpCluster runtime uses the HEFT scheduling algorithm by default.
The Round-Robin scheduling algorithm is also available. This algorithm is fast to execute but generally produces bad schedules. Still, the users can experiment with it by setting the following environment variable:
export OMPCLUSTER_SCHEDULER="roundrobin"
Setting this variable will use round-robin instead of heft the next time your application is executed.
Blocking Scheduler¶
By default, the OMPC scheduler allows multiple target tasks to be mapped to a worker at a time. This is useful to use all cores of each worker, especially if it does not have parallel computations within the tasks.
However, it might not be the ideal behavior for all applications, especially when target tasks already perform parallel computations (e.g. a parallel for
within a target nowait
). In that case, the blocking behavior can be enabled for any scheduling strategy (round_robin
, heft
, etc) by setting the following environment variable:
export OMPCLUSTER_BLOCKING_SCHEDULER=1
When set, the scheduler behaves as a blocking scheduler: it only allow a single target task to be mapped to a worker ate a time. This is useful to avoid competition in using the hardware resources of the workers.
Tuning HEFT¶
HEFT is a heuristic-based list-scheduling algorithm that makes decisions based on:
Computation cost: how many time units is required for executing a single task;
Communication cost: how many time units for transferring data between two tasks.
Unfortunately, the runtime does not know have this information ahead of time and expects the user to provide them via environment variables. Take a look at the example below. The first variable, OMPCLUSTER_HEFT_COMP_COST
indicates how long does it take to execute a task. Secondly, the variable OMPCLUSTER_HEFT_COMP_COEF
specifies a coefficient in relation to the computation cost. For example, a coefficient of 2 indicates that communication costs twich as much as computation.
The actual units here (miliseconds, seconds, minutes) are not as important as the relationship between computation and communication cost.
# Computation cost in time units (could be miliseconds, seconds, ...)
# E.g. computation of a task takes 30 time units
export OMPCLUSTER_HEFT_COMP_COST="30"
# Communication cost as a coefficient of computation cost
# E.g. communication of a dependency takes 2x as much time as a computation.
export OMPCLUSTER_HEFT_COMP_COEF="2.0"
Tip: Adjusting the costs may required some experimenting and iteration from the user-side in order to find a good balance. Dumping the task graph (see next section) can also help here.
Note: The OmpCluster runtime does not yet support tasks and dependencies with different costs. If your application fits this category we suggest you set an average value to cover both cases.
Dumping the Task Graph¶
If you wish to inspect the final scheduled graph you can use the OMPCLUSTER_TASK_GRAPH_DUMP_PATH
to specify a path where to dump using the GraphViz dot language.
# Specify a path where the task graph should be dumped (required)
export OMPCLUSTER_TASK_GRAPH_DUMP_PATH="<path>/<file_prefix>"
# Show the edge weights in the graph (optional)
export OMPCLUSTER_HEFT_DUMP_EDGE_LABEL=1
Dumping internal HEFT data¶
Note: This option is useful for developers who are looking to troubleshoot the execution of the algorithm itself. Reguler users can safely ignore this option.
The following environment variable will instruct the HEFT scheduler to dump its internal state before the application exists. This is useful if you are looking to inspect the EST, EFT, AST, AFT tables.
export OMPCLUSTER_HEFT_LOG="/path/to/heft.log"
Environment variables¶
This section describes the environment variables than can be used to tune the runtime.
OpenMP Target Runtime¶
Non-exhaustive list of the settings for the LLVM OpenMP Target runtime library. The full list is available on the upstream website. Please notice some settings might differ since our version of LLVM is not synchronized with the last version of LLVM.
LIBOMPTARGET_INFO
¶
The variable controls whether or not the runtime provides additional information during the execution. The output provided is intended for use by application developers. It is recommended to build your application with debugging information enabled, this will enable filenames and variable declarations in the information messages. More information on how to use this environment variable is available here.
LIBOMPTARGET_DEBUG
¶
The variable controls whether or not debugging information will be displayed.
This feature is only available if libomptarget was built with
-DOMPTARGET_DEBUG
. The debugging output provided is intended for use by
libomptarget developers.
LIBOMPTARGET_MEMORY_MANAGER_THRESHOLD
¶
LIBOMPTARGET_MEMORY_MANAGER_THRESHOLD
sets the threshold size for which the
libomptarget memory manager will handle the allocation. Any allocations larger
than this threshold will not use the memory manager and be freed after the
device kernel exits. Contrary to the other libomptarget plugins, the OmpCluster
runtime is using a Bump allocator with a default threshold value of 8MB. If
LIBOMPTARGET_MEMORY_MANAGER_THRESHOLD
is set to 0 the memory manager will be
completely disabled.
OMPC Runtime¶
General settings of the OmpCluster runtime
OMPCLUSTER_PROFILE
¶
The variable defines the path and file suffix used to dump the trace files
generated by the profiler. The output is a set of JSON files with the following
names path/filename_suffix + "_" + process_name + ".json"
.
There is no default value, if not set, the profiling is not performed and the execution traces are simply not saved to any file.
OMPCLUSTER_PROFILE_LEVEL
¶
It controls what profiling events that will be added to the generated trace.
Use -1
to print them all. Default value is 1 and should be sufficient for
most users.
OMPCLUSTER_DEBUG
¶
Sets the level of OmpCluster debug messages. This feature is only available if
libomptarget was built with -DOMPTARGET_DEBUG
. Default value is 0.
OMPCLUSTER_MPI_FRAGMENT_SIZE
¶
Maximum buffer size sent in a single MPI message. If a buffer is larger than this threshold, it is automatically splitted in separated messages. The default value is 100000000 bytes (100 MiB).
OMPCLUSTER_NUM_EXEC_EVENT_HANDLERS
¶
It controls the number of threads responsible for executing target regions spawned per process. The default value is 1.
OMPCLUSTER_NUM_DATA_EVENT_HANDLERS
¶
It controls the number of threads responsible for data-related events spawned per process. The default value is 1.
OMPCLUSTER_EVENT_POLLING_RATE
¶
Polling rate used by event handlers in microseconds. Small values reduce waiting time between checks, but increases CPU usage. Default value is 1 us.
OMPCLUSTER_BCAST_STRATEGY
¶
Sets how OmpCluster transfers data that must be broadcast (i.e. sent to all nodes).
Mode |
Description |
---|---|
disabled |
Data on synchronous target data map regions will be sent to each device when needed by the memory management system. |
p2p |
Data on synchronous target data map regions will be sequentially sent to every device through peer-to-peer communication. |
mpibcast |
Data on synchronous target data map regions will be sent to every device using MPI_Bcast. |
dynamicbcast |
Data on synchronous target data map regions will be sent to every device using the Dynamic Broadcast algorithm. |
Default value is disabled
.
OMPCLUSTER_ENABLE_PACKING
¶
If set to 1
enables communication events to be packed, while a value of 0
does not. When enabled, the runtime packs communication metadata and buffers that have less than OMPCLUSTER_PACKING_THRESHOLD
bytes. Dafault value is 0
.
OMPCLUSTER_PACKING_THRESHOLD
¶
The value passed to OMPCLUSTER_PACKING_THRESHOLD
defines the maximum size of the buffers that will be packed in a single MPI message. If OMPCLUSTER_PACKING_THRESHOLD=0
and OMPCLUSTER_ENABLE_PACKING=1
, then only the communication metadata is packed. If OMPCLUSTER_ENABLE_PACKING
is set to 0
, OMPCLUSTER_PACKING_THRESHOLD
is not used. Default value is 0
.
OMPC Scheduler¶
Settings of the OmpCluster runtime specific to the task scheduler
OMPCLUSTER_SCHEDULER
¶
Selects the strategy used by the scheduler to assign tasks to specific MPI processes. Currently, there are three available options, described below.
Scheduler |
Description |
---|---|
roundrobin |
Dynamic round-robin: schedules task when executing them by continuously going over the list of processes |
graph_roundrobin |
Static round-robin: schedules task when creating them by continuously going over the list of processes |
heft |
Heterogeneous Earliest Finish Time (HEFT): more advanced heuristic that takes into account the communication time. See wikipedia for more details. |
Default value is heft
.
OMPCLUSTER_BLOCKING_SCHEDULER
¶
This variable is used to configure the blocking behavior of the OMPC scheduler. A value of 0 enables multiple target regions (target tasks) to be executed at the same time in parallel by each MPI process, and a value of 1 does not. Default value is 0.
Please notice that the dynamic round-robin strategy will not assign a task to a busy process (already executing a task) if the blocking behavior is enabled.
OMPCLUSTER_TASK_GRAPH_DUMP_PATH
¶
Path used to dump the task graph generated by the scheduler, indicating to which MPI rank each task was assigned. The output is a dot file.
There is no default value, if not set, the task graph is simply not saved to any file.
HEFT parameters¶
Parameters used by the HEFT scheduler. Only used if OMPCLUSTER_SCHEDULER=heft
.
Those parameters are especially useful since the runtime is not yet able to
predict the communication and computation time of the tasks.
OMPCLUSTER_HEFT_COMM_COEF
¶
Coefficient of the communication cost. Having a higher coefficient means that communication will have more weight in relation to computation. Default value is 1.
OMPCLUSTER_HEFT_COMP_COST
¶
Default computation cost of tasks. Default value is 100.
OMPCLUSTER_HEFT_COMM_COST
¶
Default communication cost of tasks. Default value is 1.