Environment Variables

OneFlow has an extensive set of environment variables to tune for specific usage.

`ONEFLOW_COMM_NET_IB_HCA <https://github.com/Oneflow-Inc/oneflow/blob/v1.0.0/oneflow/core/comm_network/ibverbs/ibverbs_comm_network.cpp#L47>`_

When there are multiple IB NIC(which can be checked by ibstatus on the server), the system uses the first IB NIC for comm_net communication by default.

When this environment variable is set, the system will check all IB NIC and find the NIC with the corresponding name. #5626 <https://github.com/Oneflow-Inc/oneflow/pull/5626>_

Values accepted ^^^^^^^^^^^^^^^ The default value is empty, such as mlx5_0:1、 mlx5_1:1. When the port is 0, the default value is 1, representing the first port.

`ONEFLOW_COMM_NET_IB_GID_INDEX <https://github.com/Oneflow-Inc/oneflow/blob/v1.0.0/oneflow/core/comm_network/ibverbs/ibverbs_comm_network.cpp#L142>`_

For the query of ibv_query_gid <https://www.ibm.com/docs/en/aix/7.2?topic=management-ibv-query-gid>, and 0 represents success. It often used with ONEFLOW_COMM_NET_IB_HCA. GID means the Global ID, QP under RoCE network must be built by this value, instead of just using the LID as in the IB network. #5626 <https://github.com/Oneflow-Inc/oneflow/pull/5626>

Values accepted ^^^^^^^^^^^^^^^ The default value is 0, representing the port index value

`ONEFLOW_COMM_NET_IB_QUEUE_DEPTH <https://github.com/Oneflow-Inc/oneflow/blob/v1.0.0/oneflow/core/comm_network/ibverbs/ibverbs_qp.cpp#L44>`_

Queue length of jobs in IB network.

This value effectively controls the size of the module without instead of using IB's default size, such as ONEFLOW_COMM_NET_IB_MEM_BLOCK_SIZE.

Values accepted ^^^^^^^^^^^^^^^ The default value is 1024, receiving int64_t. The system would compare with max_qp_wr (Maximum number of outstanding WR on any work queue), and take the smaller one.

`ONEFLOW_COMM_NET_IB_MEM_BLOCK_SIZE <https://github.com/Oneflow-Inc/oneflow/blob/v1.0.0/oneflow/core/comm_network/ibverbs/ibverbs_qp.cpp#L68>`_

The size of the module read when communicating.

The value can calculate the amount of module, and transmit it after encapsulation.

Values accepted ^^^^^^^^^^^^^^^ The default value is 8388608 (8M)

`ONEFLOW_STREAM_CUDA_EVENT_FLAG_BLOCKING_SYNC <https://github.com/Oneflow-Inc/oneflow/blob/v1.0.0/oneflow/core/ep/cuda/cuda_device.cpp#L59>`_

Represents stream, and marks Blocking synchronization in cuda. Detailed information <https://www.cnblogs.com/1024incn/p/5891051.html>, #5612 <https://github.com/Oneflow-Inc/oneflow/pull/5612>, #5837 <https://github.com/Oneflow-Inc/oneflow/pull/5837>_

Values accepted ^^^^^^^^^^^^^^^ Define and set to false, and would be true` only when the value is 1, true, yes, onandy``.

`ONEFLOW_LIBIBVERBS_PATH <https://github.com/Oneflow-Inc/oneflow/blob/v1.0.0/oneflow/core/platform/lib/ibv_wrapper.cpp#L24>`_

To load the DynamicLibrary by dlopen at runtime, to find symbols of ibverbs functions by dlopen without linking during compile for better compatibility. #4852 <https://github.com/Oneflow-Inc/oneflow/pull/4852>_.

If it failed, it will output libibverbs not available, ibv_fork_init skipped, if it worked, the import oneflow will output such as loaded library: /usr/lib/x86_64-linux-gnu/libibverbs.so.1

Values accepted ^^^^^^^^^^^^^^^ The default value is empty, but will load libibverbs.so.1, libibverbs.so.

`ONEFLOW_DEBUG_MODE <https://github.com/Oneflow-Inc/oneflow/blob/v1.0.0/oneflow/core/common/env_var/debug_mode.h#L23>`_

Enable debug mode, ONEFLOW_DEBUG can do.

If debug mode is on, it will output more INFO level logs, different prototxt and dot to files. The automatically inserted boxing information will be printed to the log file under eager global mode.

Values accepted ^^^^^^^^^^^^^^^ The default value is empty, but will receive any string.

`ONEFLOW_DRY_RUN <https://github.com/Oneflow-Inc/oneflow/blob/v1.0.0/oneflow/core/job/resource_desc.cpp#L65>`_

Only for test running, it can generate log files like dot.

Exit once the test is succeed, do not try real training.

Values accepted ^^^^^^^^^^^^^^^ The default value is empty, but will receive any string.

`ONEFLOW_DEBUG_KERNEL_SYNC_CHECK_NUMERICS <https://github.com/Oneflow-Inc/oneflow/blob/v1.0.0/oneflow/core/lazy/stream_context/cuda/cuda_stream_context.cpp#L66>`_

Only used when debugging because the performance would be affected, it could detect which op in the network appears nan or inf.

It will create CpuCheckNumericsKernelObserver under cpu , and CudaCheckNumericsKernelObserver under cuda #6052 <https://github.com/Oneflow-Inc/oneflow/pull/6052>_ .

Values accepted ^^^^^^^^^^^^^^^ Define and set to false, and would be true only when the value is 1, true, yes, on and y.

`ONEFLOW_DEBUG_KERNEL_SYNC_CHECK <https://github.com/Oneflow-Inc/oneflow/blob/v1.0.0/oneflow/core/job/env_global_objects_scope.cpp#L193>`_

Only used when debugging because the performance would be affected.

It will create SyncCheckKernelObserver and will be synced after each kernel.

It could be used to debug cuda errors. #6052 <https://github.com/Oneflow-Inc/oneflow/pull/6052>_

Values accepted ^^^^^^^^^^^^^^^ Define and set to false, and would be true only when the value is 1, true, yes, on and y.

`ONEFLOW_PROFILER_KERNEL_PROFILE_CUDA_MEMORY_BANDWIDTH <https://github.com/Oneflow-Inc/oneflow/blob/v1.0.0/oneflow/core/profiler/kernel.cpp#L34>`_

Used when generate profiler files by nsys.

Profiler is only valid for lazy temporarily.

It can estimate the memory bandwidth reached by kernel by counting the execution time of the GPU kernel and the size of the input and output memory, and help find potential kernels that can be optimized. Details <https://github.com/Oneflow-Inc/oneflow/blob/02e29f9648f63a4d936cd818061e90064d027005/oneflow/core/profiler/kernel.cpp#L53>_

Values accepted ^^^^^^^^^^^^^^^ Define and set to false. When using, the compiled package needs to enable BUILD_PROFILER.

`ONEFLOW_PROFILER_KERNEL_PROFILE_KERNEL_FORWARD_RANGE <https://github.com/Oneflow-Inc/oneflow/blob/v1.0.0/oneflow/core/profiler/kernel.cpp#L36>`_

The same as above. collect op name <https://github.com/Oneflow-Inc/oneflow/blob/v1.0.0/oneflow/core/profiler/kernel.cpp#L62>_

Values accepted ^^^^^^^^^^^^^^^ Define and set to false. When using, the compiled package needs to enable BUILD_PROFILER.

`ONEFLOW_KERNEL_DISABLE_BLOB_ACCESS_CHECKER <https://github.com/Oneflow-Inc/oneflow/blob/v1.0.0/oneflow/core/job/env_global_objects_scope.cpp#L199>`_

Only use blob_access_checker after enabling, because blob_access_checker is for correctness assurance, and closing it in some cases can increase the kernel overhead. #5728 <https://github.com/Oneflow-Inc/oneflow/pull/5728>_

Values accepted ^^^^^^^^^^^^^^^ Define and set to false, and would be true only when the value is 1, true, yes, on and y.

`ONEFLOW_KERNEL_ENABLE_CUDA_GRAPH <https://github.com/Oneflow-Inc/oneflow/blob/v1.0.0/oneflow/core/kernel/user_kernel.cpp#L692>`_

Takes effect under WITH_CUDA_GRAPHS and the default value is false. It uses more memory, so when there's just enough memory, it won't run.

Turning on CUDA_GRAPH will use up more memory CUDA Graphs support. #5868 <https://github.com/Oneflow-Inc/oneflow/pull/5868>_

Values accepted ^^^^^^^^^^^^^^^ Define and set to false, and would be true only when the value is 1, true, yes, on and y.

`ONEFLOW_ACTOR_ENABLE_LIGHT_ACTOR <https://github.com/Oneflow-Inc/oneflow/blob/v1.0.0/oneflow/core/thread/thread.cpp#L30>`_

LightActor is a new type of Actor that only handles NormalForward and similar tasks where all regst_num is 1 or tasks with only one kernel. #5868 <https://github.com/Oneflow-Inc/oneflow/pull/5868>_. export ONEFLOW_KERNEL_ENABLE_CUDA_GRAPH=1 (Would use more memories), export ONEFLOW_THREAD_ENABLE_LOCAL_MESSAGE_QUEUE=1, export ONEFLOW_KERNEL_DISABLE_BLOB_ACCESS_CHECKER=1, export ONEFLOW_ACTOR_ENABLE_LIGHT_ACTOR=1, export ONEFLOW_STREAM_REUSE_CUDA_EVENT=1 can be used together.

Values accepted ^^^^^^^^^^^^^^^ Define and set to false, and would be true only when the value is 1, true, yes, on and y.

`ONEFLOW_THREAD_ENABLE_LOCAL_MESSAGE_QUEUE <https://github.com/Oneflow-Inc/oneflow/blob/v1.0.0/oneflow/core/thread/thread.cpp#L29>`_

#5720 <https://github.com/Oneflow-Inc/oneflow/pull/5720>_. It is used to enable local message queue, oneflow.config.thread_enable_local_message_queue(True) is no longer used.

Values accepted ^^^^^^^^^^^^^^^ Define and set to false, and would be true only when the value is 1, true, yes, on and y.

`ONEFLOW_PERSISTENT_IN_STREAM_BUFFER_SIZE_BYTES <https://github.com/Oneflow-Inc/oneflow/blob/v1.0.0/oneflow/core/persistence/persistent_in_stream.cpp#L30>`_

Represents the size of each read from disk. #5162 <https://github.com/Oneflow-Inc/oneflow/pull/5162>_

Values accepted ^^^^^^^^^^^^^^^ The default value is empty. If an invalid string or negative number is entered, the default value would be 32 * 1024; 32KB.

`ONEFLOW_DECODER_ENABLE_NVJPEG_HARDWARE_ACCELERATION <https://github.com/Oneflow-Inc/oneflow/blob/v1.0.0/oneflow/core/kernel/image_decoder_random_crop_resize_kernel.cpp#L290>`_

NVJPEG_VER_MAJOR need to be bigger than 11. It can accelerate nvjpeg hardware, warm up jpeg decoder and hw_jpeg decoder, #5851 <https://github.com/Oneflow-Inc/oneflow/pull/5851>_.

Hardware JPEG decoder and NVIDIA nvJPEG library on NVIDIA A100 GPUs

Values accepted ^^^^^^^^^^^^^^^ Define and set to true, and would be true only when the value is 1, true, yes, on and y.