ADRs/0003-NdArray_Strides_ArmCompute.md
Implemented
Proposed by: Abdelrauf (23/09/2020)
Discussed with:
During the integration process of our library with arm_compute, I faced that our NdArray strides are not flexible. (i.e it cant be set properly without special and manual handling).
Let's say our Nd Array shapes are [3,4,2] and the last index is moving faster (i.e C order). Then our strides will be [ 8, 2, 1 ].
As far as I know, our last index stride can be different (called as ews), but overall strides should follow the cyclic strict rule of dependency.:
strides[index-1] = strides[index] * shapes[index];
On arm_compute besides strides there is also Padding {top, right, bottom, left} that can be used to increase strides and change offsets adn as well as total size. its mostly done for performance reasons. As from above we can see that its just hosting NdArray shape in the buffer of the bigger NdArray shape. In arm_compute those paddings applied to last 2 dimensions (on NCHW it will be H and W}. We can define it like this:
newH = pad.top + H + pad.bottom;
newW = pad.left + W + pad.right;
so strides will be calculated for the shape {N,C, newH, newW} and offset of the first element will be:
offset = pad.left * strideOfNewW + pad.top * strideOfNewH
Introduce helper functions checking below case :
strides[index-1] >= strides[index] * shapes[index];
Add generic method for the padded buffer ( we can simulate arm_compute 2d padding and more)
int paddings[rank] = {...}; // total padding
int paddingOffsets[rank] = {...}; //offset indices of the first element
This could be used to padd ndArray shapes and calculate strides based on it while keeping original shape, paddOffsets could be used to determine the beginning of the first element. Though this interface ismore generic its drawback is that on armcompute its possible to padd 1d into 2D while keeping rank but on this one we should supply 2d with one of its dimensions being 1.
A little investigation showed that the current NdArray actually has constructors to specify strides. Here is the constructor that could be used ShapeDescriptor.h Here are additions into ShapeDescriptor:
The method that is using ShapeDescriptor validation, and ShapeDescriptor paddedBuffer .
Furthermore to indicate that shape of the NdArray is using paddedBuffer we will flag with ARRAY_HAS_PADDED_BUFFER . so it will be possible to know if NdArray is padded.
Furthermore, it is still possible to recover Paddings from the allocation size of the padded NdArray. But its not an easy task to get PaddingOffsets from offset and recovered full shape. Thats why it requires storing them. Fortunately, for arm_compute tensors manual padding we just need to know total size and the offset of the first element. So we dont need to change internals that much
As our padded Buffer follows the strict ews() rule instead of the loose one. Paddings will be obtained from this rule:
strides[index-1] == strides[index] * shapes[index];
pseudo code for C order:
for (int j = rank - 1; j >= 0; j--) {
shapesAfterPadding[j] = strides[j - 1] / strides[j]
}
shapesAfterPadding[0] = buffer.AllocSize / strides[0]
//Paddings for index in 0..rank-1
paddings[index] = shapesAfterPadding[index] - shape[index]
The main drive for the above proposal to avoid unnecessary performance and memory allocation. And also we should keep on mind :
This can diminish the necessity for the proposed changes if such versions of the desired functions are implemented.
Arm_compute tensors are mostly 3d 4d with max 6d dimensions. So lets show C order NdArray({2,2,5,5},)
shapeInfo shapeInfo: [4, 2,2,5,5, 50,25,5,1, 8192,1,99]
of float type and its arm_compute tensor equivalent :
NdArray{n,z,y,x} -> TensorShape{x,y,z,n}total length in bytes: 400
shapes: 5,5,2,2,1,1,
strides in bytes: 4,20,100,200,0,0,
strides as elements: (1,5,25,50)
Paddings in arm_compute Tensors. Padding{left,right, top, bottom}
As both OpenCL and NEON use vector loads and stores instructions to access the data in buffers, so in order to avoid having special cases to handle for the borders all the images and tensors used in this library must be padded
There are different ways padding can be calculated:
in arm_compute Tensor: it's 2d {Width Height} can be padded and thats why it affects strides. Lets show it with the picture:
\ top /
\ _____________________ /
left | ^ | right
| Width |
| <-Height |
| |
| |
----------------------
/ bottom \
/ \
Here is the stride calculation pseudo code for Tensor {x,y,z}
stride_x = element_size(); //float will be 4
stride_y = (padding.left + _tensor_shape[0] + padding.right) * stride_x;
stride_z = (padding.top + _tensor_shape[1] + padding.bottom) * stride_y;
required_offset_first_element = padding.left * stride_x + padding.top * stride_y;
For example: if arm_tensor had padding: left 0, right 1, top 0, bottom 1 :
total: 576
shapes: 5,5,2,2,1,1,
strides in bytes: 4,24,144,288,0,0,
This is a simple wrapper for arm functions with input and output tensors: armcomputeUtils.h#L95-L165
From above we could see :
So from above we can conclude that we have two options:
Here is auto padding:
// Some kernels compute 32 elements at the time, worst case scenario they
// will read 32 values after the last element
extra_pad_x = _tensor_shape.num_dimensions() < 1 ? 0 : 32;
pad_x = _tensor_shape.num_dimensions() < 1 ? 0 : 4;
pad_y = _tensor_shape.num_dimensions() < 2 ? 0 : 4;
PaddingSize(pad_y, pad_x + extra_pad_x, pad_y, pad_x);