libnd4j/dev-docs/AddingNewOps.md
There's multiple different Ops designs supported in libND4j, and in this guide we'll try to explain how to build your very own operation.
This kind of operations is actually split into multiple subtypes, based on element-access and result type:
Despite differences between these operations, they are all using XZ/XYZ three-operand design, where X and Y are inputs, and Z is output. Data access in these operations is usually trivial, and loop based. I.e. most trivial loop for scalar transform will look like this:
for (sd::LongType i = start; i < end; i++) {
result[i] = OpType::op(x[i], scalar, extraParams);
}
Operation used in this loop will be template-driven, and compiled statically. There are another loops implementation, depending on op group or strides within NDArrays, but idea will be the same all the time: each element of the NDArray will be accessed within loop.
Now, let's take a look into typical XYZ op implementation. Here's how Add operation will look like:
template<typename T>
class Add {
public:
SD_OP_DEF static T op(T d1, T d2) {
return d1 + d2;
}
// this signature will be used in Scalar loops
SD_OP_DEF static T op(T d1, T d2, T *params) {
return d1 + d2;
}
// this signature will be used in reductions
SD_OP_DEF static T op(T d1) {
return d1;
}
// op for MetaOps
SD_OP_DEF static T op(T d1, T *params) {
return d1 + params[0];
}
};
This particular operation is used in different XYZ op groups, but you see the idea: element-wise operation, which is invoked on each element in given NDArray.
So, if you want to add new XYZ operation to libnd4j, you should just add operation implementation to file includes/ops/ops.h, and assign it to specific ops group in file includes/loops/legacy_ops.h together with some number unique to this ops group, i.e.: (21, simdOps::Add)
After libnd4j is recompiled, this op will become available for legacy execution mechanism, NDArray wrappers, and LegacyOp wrappers (those are made to map legacy operations to CustomOps design for Graph).
Custom operations is a new concept, added recently and mostly suits SameDiff/Graph needs. For CustomOps we defined universal signature, with variable number of input/output NDArrays, and variable number of floating-point and integer arguments. However, there are some minor differences between various CustomOp declarations:
Let's take a look at example CustomOp:
CUSTOM_OP_IMPL(tear, 1, -1, false, 0, -1) {
auto input = INPUT_VARIABLE(0);
REQUIRE_TRUE(!block.getIArguments()->empty(), 0, "At least 1 dimension should be specified for Tear");
std::vector<int> dims(*block.getIArguments());
for (auto &v: dims)
REQUIRE_TRUE(v >= 0 && v < input->rankOf(), 0, "Tear dimensions should be non-negative values, and lower than input rank. Got %i instead", v);
auto tads = input->allTensorsAlongDimension(dims);
for (int e = 0; e < tads->size(); e++) {
auto outE = OUTPUT_VARIABLE(e);
outE->assign(tads->at(e));
this->storeResult(block, e, *outE);
}
delete tads;
return sd::Status::OK;
}
DECLARE_SHAPE_FN(tear) {
auto inShape = inputShape->at(0);
std::vector<int> dims(*block.getIArguments());
if (dims.size() > 1)
std::sort(dims.begin(), dims.end());
shape::TAD tad(inShape, dims.data(), (int) dims.size());
tad.createTadOnlyShapeInfo();
sd::LongType numTads = shape::tadLength(inShape, dims.data(), (int) dims.size());
auto result = SHAPELIST();
for (int e = 0; e < numTads; e++) {
result->push_back(tad.tadOnlyShapeInfo);
}
return result;
}
In the example above, we declare tear CustomOp implementation, and shape function for this op.
So, at the moment of op execution, we assume that we will either have output array(s) provided by end-user, or they will be generated with shape function.
You can also see number of macros used, we'll cover those later as well. Beyond that - op execution logic is fairly simple & linear:
Each new op implements protected member function DeclarableOp<T>::validateAndExecute(Block<T>& block), and this method is eventually called either from GraphExecutioner, or via direct call, like DeclarableOp<T>::execute(Block<T>& block).
Important part of op declaration is input/output description for the op. I.e. as shown above: CUSTOM_OP_IMPL(tear, 1, -1, false, 0, -1).
This declaration means:
tearHere's another example: DECLARE_CUSTOM_OP(permute, 1, 1, true, 0, -2);
This declaration means:
permuteIn ops you can easily use c++11 features, including lambdas. In some cases it might be easiest way to build your custom op (or some part of it) via NDArray::applyLambda or NDArray::applyPairwiseLambda:
auto lambda = LAMBDA_TT(_x, _y) {
return (_x + _y) * 2;
};
x.applyPairwiseLambda(&y, lambda);
In this simple example, each element of NDArray x will get values set to x[e] = (x[e] + y[e]) * 2.
For tests libnd4j uses Google Tests suit. All tests are located at tests_cpu/layers_tests folder. Here's a simple way to run those from command line:
cd tests_cpu
cmake -G "Unix Makefiles"
make -j 4
./layers_tests/runtests
You can also use your IDE (i.e. Jetbrains CLion) to run tests via GUI.
PLEASE NOTE: if you're considering submitting your new op to libnd4j repository via pull request - consider adding tests for it. Ops without tests won't be approved.
GPU/MPI/whatever to be added soon.
We have number of utility macros, suitable for custom ops. Here they are:
NDArray::applyLambda and NDArray::applyPairwiseLambdaWe should explicitly instantiate template methods for different data types in libraries. Furthermore, to speed up parallel compilation we need to add those template instantiations in separate source files. Besides, another reason is that: some compilers are choked when these template instantiations are many in one translation unit.
To ease this cumbersome operation we have Cmake helper and macros helpers.
Example:
Suppose we have such function:
template<typename X, typename Z>
void argMin_(const NDArray& input, NDArray& output, const std::vector<sd::LongType>& dimensions);
We should write this to explicitly instantiate it.
BUILD_DOUBLE_TEMPLATE(template void argMin_, (const NDArray& input, NDArray& output, const std::vector<sd::LongType>& dimensions),
SD_COMMON_TYPES, SD_INDEXING_TYPES);
Here:
But to speed up compilation process and also helping compilers we can further separate it into different source files. Firstly we rename the original template source with hpp extension:
Secondly we add file with the suffix cpp.in (or cu.in for cuda) that will include that hpp header and place it in the appropriate compilation units folder. in our case it will be in ./libnd4j/include/ops/declarable/helpers/cpu/compilation_units folder with the name argmax.cpp.in .
Later we decide which type we want to separate into different sources. In our case we want to split SD_COMMON_TYPES (other ones: SD_INTEGER_TYPES , SD_FLOAT_TYPES, SD_PAIRWISE_TYPES ). We hint cmake that case with this (adding _GEN suffix):
#cmakedefine SD_COMMON_TYPES_GEN
Then we just add _@FL_TYPE_INDEX@ as suffix in type name and it will split those types for us and generate cpp files inside ${CMAKE_BINARY_DIR}/compilation_units folder.
LIBND4J_TYPE_@FL_TYPE_INDEX@
Here is how the complete cpp.in file will look like:
#cmakedefine SD_COMMON_TYPES_GEN
//this header is where our template functions resides
#include <ops/declarable/helpers/cpu/indexReductions.hpp>
//guard against undefined cases
#if defined(SD_COMMON_TYPES_GEN) && defined(SD_COMMON_TYPES_@FL_TYPE_INDEX@)
namespace sd {
namespace ops {
namespace helpers {
BUILD_DOUBLE_TEMPLATE(template void argMax_, (const NDArray& input, NDArray& output, const std::vector<sd::LongType>& dimensions),
SD_COMMON_TYPES_@FL_TYPE_INDEX@, SD_INDEXING_TYPES);
}
}
}
#endif