docs/AdjacencyList.md
The AdjacencyList in Parcel is a complex implementation of
an otherwise straightforward data structure.
The core ideas behind the AdjacencyList are:
Conceptually, this is achieved by associating each edge that is added to or from a node with a previous edge that was added to or from that node.
Where Parcel's AdjacencyList gets complex is in its data layer, which is
designed for:
The simplest way of thinking about AdjacencyList is as a list of lists.
The primary list contains (conceptually) node structs (ordered by numeric id),
with each field on the struct pointing to doubly linked lists of (again, conceptual)
edge structs that also connect to or from the same node (ordered by insertion).
This looks something like:
// NOTE: Not a real type, just an illustration!
type AdjacencyList<{
id: NodeId,
incoming?: EdgeList,
incomingReverse?: EdgeList,
outgoing?: EdgeList,
outgoingReverse?: EdgeList,
}>
type EdgeList<{
from: NodeId,
to: NodeId,
}>
graph TD;
0 --> |a| 1
0 --> |b| 2
1 --> |c| 2
graph LR
subgraph 0[Node 0]
direction LR
0o([outgoing]) --- 0oa[[a]] <--> 0ob[[b]] --- 0or([outgoingReverse])
end
subgraph 1[Node 1]
direction LR
1i([incoming]) --- 1ia[[a]] --- 1ir([incomingReverse])
1o([outgoing]) --- 1oc[[c]] --- 1or([outgoingReverse])
end
subgraph 2[Node 2]
direction LR
2i([incoming]) --- 2ib[[b]] <--> 2ic[[c]] --- 2ir([incomingReverse])
end
This makes traversal of every edge of the graph from any node a straightforward process of following the links.
</td></tr> </table>Under the hood, things are less straightforward.
Since Parcel is multi-threaded, its implementation of AdjacencyList
features another core idea:
AdjacencyList should be accessible
from multiple threads with no overheadHere, overhead refers to the serialization costs that add up in multi-threading scenarios.
To accommodate this, AdjacencyList features a custom data structure called
the SharedTypeMap:
Uint32Array
backed by a SharedArrayBufferTYPE fieldThis means that data being added to SharedTypeMap gets a pass through
a hash function to produce stable key for storing and retrieving that data.
Two nodes in a Parcel graph may be related to in each other in more than one way. In order to represent multiple relationships between the same two nodes, each edge that connects the nodes has a type.
Thus, when storing data about connections between adjacent nodes,
AdjacencyList uses the type as part of the identifying information.
As mentioned above, to deal with collisions, SharedTypeMap employs a version
of a coalesced hashing strategy. In this version:
The SharedTypeMap is partitioned into 3 subarrays:
graph LR
header ~~~ hashtable ~~~ items
subgraph header
direction LR
0[0: CAPACITY] ~~~ 1[1: COUNT]
end
subgraph hashtable[hash table]
direction LR
2[2: hash 0] ~~~ 3[3: hash 1] ~~~ 4[4: hash 2]
end
subgraph items[addressable space]
direction LR
item1 ~~~ item2 ~~~ item3 ~~~ item4
subgraph item1[item 1]
direction LR
5([5: NEXT]) ~~~ 6[6: TYPE]
end
subgraph item2[item 2]
direction LR
7([7: NEXT]) ~~~ 8[8: TYPE]
end
subgraph item3[item 3]
direction LR
9([9: NEXT]) ~~~ 10[10: TYPE]
end
subgraph item4[item 4]
direction LR
11([11: NEXT]) ~~~ 12[12: TYPE]
end
end
CAPACITY: the total number of items that can fit in the array.
This is always equal to the length of the hash table.COUNT: the number of items that are currently in
the addressable space partition.NEXT: the address of the next item with the same hash as this item;
this is a link in the hash bucket.TYPE: the item's typeNote that the fields described above are inherited by NodeTypeMap
and EdgeTypeMap. They both extend this base set with fields of their own.
When there is a hash collision, the previous value is augmented with a link to the address of the colliding value.
These linked lists of values that have hash collisions are known as buckets.
Though SharedTypeMap stores data in a hash table, it presents
an API that is more like a linked list. Instead of adding an item
to the SharedTypeMap through something like an add method,
SharedTypeMap has a method called link:
link(hash: THash, item: TAddress, type: TItemType): void;
This method pairs with getNextAddress:
getNextAddress(): TAddress
This is because SharedTypeMap is really a base class implementation;
its subclasses (See NodeTypeMap and EdgeTypeMap)
extend this API with more map-like add methods.
As link implies, each index in the hash table can be thought of
as the head of a linked list.
When an item is added to a SharedTypeMap (or rather, to one of its subclasses),
the getNextAddress method is invoked to find the next location in the
addressable space that is unoccupied. The next address is always computed
in constant time by adding the current item count (multiplied by the item size)
to the offset of the first addressable space. This has the effect of storing items
in the addressable space in insertion order.
For example, here we see that hash 0 has 1 collision:
// map.link(hash, address, type)
map.link(0, map.getNextAddress(), 1)
map.link(1, map.getNextAddress(), 2)
map.link(2, map.getNextAddress(), 3)
map.link(0, map.getNextAddress(), 1)
graph LR
2[2: hash 0] -- head --> 5([5: NEXT]) -- next --> 11([11: NEXT])
3[3: hash 1] -- head --> 7([7: NEXT])
4[4: hash 2] -- head --> 9([9: NEXT])
Or, as raw Uint32Array data:
| label | CAPACITY | COUNT | # | # | # | NEXT | TYPE | NEXT | TYPE | NEXT | TYPE | NEXT | TYPE |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| index | 0 | 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9 | 10 | 11 | 12 |
| value | 4 | 4 | 5 | 7 | 9 | 11 | 1 | 0 | 2 | 0 | 3 | 0 | 1 |
Notice that, thanks to getNextAddress(), the items are stored
in insertion order, not by type or hash!
Also notice that the hashes we used to link are mapped to indices in the hash table
(the # above), and that the values stored at those indices are the addresses
returned by getNextAddress.
You may have also noticed that our last link has a collision with our first link
(they both use hash 0). In that case, the address for our last link is stored
in the NEXT field of the item at the address that is currently stored for hash 0:
graph LR
hash0[hash 0] -- HEADER_SIZE + 0 --> index2[2]
hash1[hash 1] -- HEADER_SIZE + 1 --> index3[3]
hash2[hash 2] -- HEADER_SIZE + 2 --> index4[4]
index2[2] -- head --> address5[5] -- next --> address11[11]
index3[3] -- head --> address7[7]
index4[4] -- head --> address9[9]
As mentioned above, SharedTypeMap is a base implementation; it isn't used
directly. AdjacencyList uses two subclasses of SharedTypeMap:
NodeTypeMap, which is referred to internally as nodesEdgeTypeMap, which is referred to internally as edgesThe business of AdjacencyList is therefore interacting with these two maps when:
nodes with records in edgesnodes from records in edgesnodes
to records in edges (and back)The EdgeTypeMap extends SharedTypeMap with API for:
Edges are identified by a hash of their to, from, and type values.
For any given set of these values, the hash function deterministically produces
a number that that fits within the current capacity of the EdgeTypeMap.
That hash number is then the index in the hash table where the head of the linked list of edges with this hash is stored.
See Hash buckets: another list of lists for more.
The EdgeTypeMap records can be thought of as the links
in the linked lists of edges connected to or from a given node.
As described in Node records, There is a linked list per edge type, and also per direction:
graph TD;
0 --> |1| 1
0 --> |2| 1
1 --> |1| 2
0 --> |2| 2
1 --> |2| 2
graph LR
subgraph e4[edge 4]
direction LR
e4n1[1] -- 2 --> e4n2[2]
end
subgraph e3[edge 3]
direction LR
e3n1[1] -- 1 --> e3n2[2]
end
subgraph e2[edge 2]
direction LR
e2n0[0] -- 2 --> e2n2[2]
end
subgraph e1[edge 1]
direction LR
e1n0[0] -- 2 --> e1n1[1]
end
subgraph e0[edge 0]
direction LR
e0n0[0] -- 1 --> e0n1[1]
end
e1 -- next out --> e2 -- prev out --> e1
e2 -- next in --> e4 -- prev in --> e2
Because edge records are only created once per unique pair of node ids
and edge type, and deleted edges do not get reclaimed (without a resize
and rehash of every edge, see Resizing the EdgeTypeMap),
the maximum number of edge records that will be stored is d + n * n * e,
where n is the number of nodes and e is the number of unique edge types,
and d is the number of deleted edges.
Edge records occupy 32 bytes of contiguous space,
labelled as the following uint32 (4 byte) fields:
NEXT: The next edge record with the same hashTYPE: The edge typeFROM: The id of the originating nodeTO: The id of the terminating nodeNEXT_IN: The address of the next incoming edge of the same type to the same terminating nodePREV_IN: The address of the previous incoming edge of the same type to the same terminating nodeNEXT_OUT: The address of the next outgoing edge of the same type from the same originating nodePREV_OUT: The address of the previous outgoing edge of the same type from the same originating nodeThe capacity of EdgeTypeMap is always at least the total number of edges
in the graph.
Note that between any two nodes, there my only be one edge per unique edge type.
Recalling that the Uint32Array is partitioned into 3 subarrays:
| header | hash table | addressable space |
|---|
Given a capacity of c:
Uint32Array are the header: HEADER_SIZE = 3cITEM_SIZE * cITEM_SIZE = 8The size of the Uint32Array is therefore 3 + 9c.
The EdgeTypeMap is resized on demand in addEdge.
There are two scenarios in which the EdgeTypeMap is resized:
if there is not enough capacity in the addressable space to accommodate an additional edge
if there are enough deletes that the capacity can be reduced and still fit the number of edges plus an additional edge.
In the first case, the capacity must increase. This is accomplished by
computing a next capacity from the current capacity. The next capacity
follows a linear progression, starting from the maxGrowFactor (defaulting to 8),
and scaled inversely linearly the minGrowFactor (defaulting to 2) from capacity
0 to peakCapacity. Roughly the following formula:
maxGrowFactor + (minGrowFactor - maxGrowFactor) * (capacity / peakCapacity);
Once the capacity exceeds peakCapacity, the capacity grows by the minGrowFactor.
In the second case, the capacity can decrease. This is accomplished by halving the capacity.
In both cases, after computing the next capacity, a new EdgeTypeMap
is created with that new capacity along with a new NodeTypeMap that retains
its current capacity. Every edge in the current EdgeTypeMap is then
rehashed and relinked into the new maps.
The NodeTypeMap extends SharedTypeMap with API for:
The NodeTypeMap records can be thought of as the heads of
linked lists of edges connected to or from a given node.
There is a linked list per edge type, and also per direction:
graph TD;
0 --> |1| 1
0 --> |2| 1
1 --> |1| 2
0 --> |2| 2
graph LR
subgraph n0[node 0]
et1(edge type 1)
et2(edge type 2)
end
subgraph edges
subgraph e0[edge 0]
direction LR
e0n0[0] -- 1 --> e0n1[1]
end
subgraph e1[edge 1]
direction LR
e1n0[0] -- 2 --> e1n1[1]
end
subgraph e2[edge 2]
direction LR
e2n0[0] -- 2 --> e2n2[2]
end
end
et1 -- first out --> e0
et1 -- last out --> e0
et2 -- first out --> e1
et2 -- last out --> e1
e1 -- next out --> e2
e2 -- prev out --> e1
graph LR
subgraph n1[node 1]
et1(edge type 1)
et2(edge type 2)
end
subgraph edges
subgraph e0[edge 0]
direction LR
e0n0[0] -- 1 --> e0n1[1]
end
subgraph e1[edge 1]
direction LR
e1n0[0] -- 2 --> e1n1[1]
end
subgraph e3[edge 3]
direction LR
e3n1[1] -- 1 --> e3n2[2]
end
end
et1 -- first in --> e0
et1 -- last in --> e0
et1 -- first out --> e3
et1 -- last out --> e3
et2 -- first in --> e1
et2 -- last in --> e1
graph LR
subgraph n2[node 2]
et1(edge type 1)
et2(edge type 2)
end
subgraph edges
subgraph e2[edge 2]
direction LR
e2n0[0] -- 2 --> e2n2[2]
end
subgraph e3[edge 3]
direction LR
e3n1[1] -- 1 --> e3n2[2]
end
end
et1 -- first in --> e3
et1 -- last in --> e3
et2 -- first in --> e2
et2 -- last in --> e2
Because node records are only created once per unique node id and edge type,
the maximum number of node records that will be stored is n * e,
where n is the number of nodes (the count of node ids issued)
and e is the number of unique edge types in the graph.
Node records occupy 24 bytes of contiguous space,
labelled as the following uint32 (4 byte) fields:
NEXT: The next node record with the same node id, but a different TYPETYPE: The edge type for edges linked to or from the node via this recordFIRST_IN: The address in EdgeTypeMap of the first edge of this type to this nodeFIRST_OUT: The address in EdgeTypeMap of the first edge of this type from this nodeLAST_IN: The address in EdgeTypeMap of the last edge of this type to this nodeLAST_OUT: The address in EdgeTypeMap of the last edge of this type from this nodeAs implied above, adding a node to AdjacencyList does not actually increment
the count of nodes in the NodeTypeMap, it only increments the nodeId.
This means that the number of nodes in the graph is not the same as
the number of node records in the AdjacencyList.
However, the capacity of NodeTypeMap is always at least the larger of the total
number of node ids issued (nextId) or the number of node records in the map (count).
The reasons that the capacity must consider both nextId and count are:
NodeTypeMap, so the hash table must have
at least enough capacity to fit the highest node id.count
and may require additional addressable space (e.g., if most nodes are already connected).Recalling that the Uint32Array is partitioned into 3 subarrays:
| header | hash table | addressable space |
|---|
Given a capacity of c:
Uint32Array are the header: HEADER_SIZE = 3
NEXT andTYPE have the same meaning as in SharedTypeMapNEXT_ID tracks the number of unique node ids issued by addNodecITEM_SIZE * cITEM_SIZE = 6
The size of the Uint32Array is therefore 3 + 7c.
There are two scenarios in which the NodeTypeMap is resized:
in addNode, if there is not enough capacity in the hash table
to accommodate the number of nodes implied by the next node id.
in addEdge, if the edge being added is the first of its type for
either node, and there is not enough capacity in the addressable space
to accommodate the additional count of node records required.
In either case, the resizing is accomplished by creating a new NodeTypeMap
with double the capacity of the current one, and then copying the data
in the hash table and addressable space directly into the new map.
To account for the increase in the size of the hash table (remember that the size of the hash table is always equal to the capacity), all pointers into the addressable space are adjusted by the delta between the old map capacity and the new capacity, e.g., if the old capacity was 4, and now it's 8, all of the internal addresses would be shifted by 4.
This works because the node ids are sequential and immutable, so no rehashing
is necessary. The internal addresses can all be safely incremented by
the capacity delta, while the addresses to edges, which point to the EdgeTypMap,
don't need to change at all (as its size hasn't changed).
graph TD;
0 --> |1| 1
0 --> |2| 1
1 --> |1| 2
0 --> |2| 2
AdjacencyList looks like:
graph LR
subgraph nodes
subgraph n0[node 0]
na17[(type 1)]
na29[(type 2)]
na17 -- next --> na29
end
subgraph n1[node 1]
na11[(type 1)]
na23[(type 2)]
na11 -- next --> na23
end
subgraph n2[node 2]
na35[(type 2)]
na41[(type 1)]
na35 -- next --> na41
end
end
subgraph edges
subgraph e0[hash 0]
ea7[(type 1)]
end
subgraph e1[hash 1]
ea15[(type 2)]
end
subgraph e2[hash 2]
ea31[(type 1)]
end
subgraph e3[hash 3]
ea23[(type 2)]
end
ea15 -- next out --> ea23
ea23 -- prev out --> ea15
end
ea7 -- from --> n0
ea7 -- to --> n1
ea15 -- from --> n0
ea15 -- to --> n1
ea23 -- from --> n0
ea23 -- to --> n2
ea31 -- from --> n1
ea31 -- to --> n2
na11 -- first in --> ea7
na11 -- first out --> ea31
na11 -- last in --> ea7
na11 -- last out --> ea31
na17 -- first out --> ea7
na17 -- last out --> ea7
na23 -- first in --> ea15
na23 -- last in --> ea15
na29 -- first out --> ea15
na29 -- last out --> ea23
na35 -- first in --> ea23
na35 -- last in --> ea23
na41 -- first out --> ea31
na41 -- last out --> ea31
Or, as raw Uint32Array data:
nodes:
header hash table addressable space
0 1 2 3 4 5 6 7 8 9 10 11 [ 23 1 7 31 7 31 ]
8 6 3 17 11 35 0 0 0 0 0 17 [ 29 1 0 7 0 7 ]
23 [ 0 2 15 0 15 0 ]
29 [ 0 2 0 15 0 23 ]
35 [ 41 2 23 0 23 0 ]
41 [ 0 1 31 0 31 0 ]
47 [ 0 0 0 0 0 0 ]
53 [ 0 0 0 0 0 0 ]
edges:
header hash table addressable space
0 1 2 3 4 5 6 7 [ 0 1 0 1 0 0 0 0 ]
4 4 0 15 7 31 23 15 [ 0 2 0 1 0 0 23 0 ]
23 [ 0 2 0 2 0 0 0 15 ]
31 [ 0 1 1 2 0 0 0 0 ]