ADRs/0024 - Execution Tracing.md
Implemented
Proposed by: Adam Gibson (20 Mar 2023)
Discussed with: Paul Dubs
Finalized by: Adam Gibson (24 Mar 2023)
Reproducing a specific graph execution between the SameDiff and DL4J APIs can be challenging, as both use the underlying libnd4j operations to execute code. Currently, users enable verbose or debug mode in the op executioner to observe executed operations and manually compare the output of the two APIs. This method is suboptimal and time-consuming.
In the context of this proposal, the term "vector" refers to an std::vector in C++
that stores the metadata of each operation execution. It does not refer to a
mathematical vector or a tensor typically used in deep learning libraries. The
std::vector is a dynamic array-like container provided by the C++ Standard Library,
which is used here to store the sequence of operation executions.
To improve the process, we will save execution traces in a format that can generate a SameDiff graph, emulating the executed steps. Once enabled, operation executions will be collected in a vector, storing only metadata such as input/output shapes and arguments for each operation. These executions will be stored in the vector sequentially.
For instance, when executing a convolution operation, we can trigger the scope in C++ to indicate the current operation. This enables tracking the execution of the convolution operation and its nested operations, like the im2col operation.
Graph tracing can be enabled using the following command:
Nd4j.toggleTrace(true);
Using the vector of executions, we can reproduce a graph. To save the graph, use:
SameDiff sd = SameDiff.collectTrace();
sd.save(new File("mygraph.fb"));
Afterward, purge the trace to prevent memory leaks:
Nd4j.purgeTrace();
When purge is done you can disable trace with:
Nd4j.toggleTrace(false);