ADR 0033: Shape Buffer Trie Implementation

Status

Implemented

Proposed by: Adam Gibson (19-12-2024) Discussed with: Paul Dubs

Context

The libnd4j library requires efficient storage and lookup of shape information for neural network operations. Shape information is usually calculated multilple times per operation and can be expensive to maintain. One goal is to reduce the overhead by getting rid of ShapeDescriptors which were unnecessary extra allocations rather than just using only the shape buffers. This was all stored in a ShapeDescriptor-based cache with an unordered map,

The primary challenges include:

Frequent shape buffer allocations and deallocations during neural network operations
Need for fast shape lookup during computation
Memory management of redundant shape information
Thread safety requirements for parallel execution
Overhead from ShapeDescriptor creation for cache lookups
Memory overhead from unordered map storage

Decision

We implement a shape buffer trie data structure (DirectShapeTrie) to manage and cache shape information, replacing the previous unordered map implementation. The trie structure is chosen for the following characteristics:

Key Components

A trie node structure containing:
- Shape buffer pointer
- Child node pointers
- Reference counting mechanism
Striped thread safety using an array of mutexes
Direct memory management of shape buffers
Sequential shape information exploitation similar to word tries

Implementation Details

The trie stores shape buffers based on their content as sequential paths
Each unique shape path in the trie represents a unique shape configuration
Reference counting is used to manage memory lifecycle
Thread safety is ensured through striped mutex locks
Direct memory allocation is used instead of standard containers
Removed ShapeDescriptor creation requirement for lookups
Shape values are stored sequentially in the trie, similar to characters in a word trie

Visual Example

Root
├── 2 (rank)
│   ├── 3,4 (shape values)
│   │   └── [ptr: shape_buffer_1]
│   └── 5,6 (shape values)
│       └── [ptr: shape_buffer_2]
└── 3 (rank)
    ├── 2,3,4 (shape values)
    │   │   └── [ptr: shape_buffer_3]
    └── 4,5,6 (shape values)
        └── [ptr: shape_buffer_4]

In this example:

Each level represents a component of the shape
First level: rank of the array
Subsequent levels: actual shape values
Leaf nodes contain pointers to the actual shape buffers
Multiple shapes can share common prefixes, saving memory

Thread Safety Implementation

The shape buffer cache implements striped locking using an array of mutexes:

cpp

mutable std::array<MUTEX_TYPE, NUM_STRIPES> _mutexes;

This design provides:

Reduced contention through multiple lock stripes
Better concurrency than a single global mutex
Lower memory overhead than per-node locking
Const-correctness through mutable mutex array

The striping mechanism:

Distributes shapes across multiple mutexes based on their characteristics
Allows concurrent operations on shapes in different stripes
Balances between fine-grained locking and implementation complexity

Consequences

Advantages

Memory Efficiency:
- Eliminates redundant shape buffer storage
- Automatic cleanup of unused shapes through reference counting
- Shared shape buffers across operations
- Removal of ShapeDescriptor allocation overhead
- Better memory locality due to trie structure
Performance:
- O(n) lookup time where n is the shape length
- Efficient shape comparison through pointer equality
- Reduced memory allocation overhead
- No ShapeDescriptor creation cost for lookups
- Sequential access patterns for shape values
Thread Safety:
- Safe concurrent access through striped mutex protection
- Atomic reference counting operations
- Protected shape buffer lifecycle management
Concurrency:
- Striped locking enables parallel access to different shape regions
- Better scaling under high concurrency than single mutex
- Maintains simplicity compared to per-node locking

Disadvantages

Implementation Complexity:
- Manual memory management requires careful implementation
- Reference counting edge cases need careful handling
- Thread synchronization adds complexity
- More complex trie traversal logic
Memory Overhead:
- Trie structure itself introduces memory overhead
- Additional pointers for trie navigation
Performance Trade-offs:
- Stripe selection adds minor overhead
- Multiple shapes might still hash to same stripe
- Reference counting operations add CPU overhead
- Fixed number of stripes limits maximum parallelism

Technical Details

Memory Management

cpp

sd::LongType* createBuffer(int length);
void deleteBuffer(sd::LongType* buffer);

API Design

cpp

sd::LongType* lookupBuffer(const sd::LongType* shape, int length);
void registerBuffer(const sd::LongType* shape, int length);
void decrementRef(const sd::LongType* buffer);

Alternatives Considered

Hash Table Implementation (Previous Approach):
- Pros: Simpler implementation, O(1) average lookup
- Cons: More memory usage, ShapeDescriptor overhead, potential hash collisions
- Remov
Alternative Locking Strategies:
- Single Global Mutex:
  - Pros: Simplest implementation
  - Cons: High contention under load
- Per-Node Locking:
  - Pros: Maximum concurrency
  - Cons: High memory overhead, complex synchronization
- Lock-Free Design:
  - Pros: No lock contention
  - Cons: Extremely complex implementation, harder to verify correctness