website/docs/dev/internals/indexedlog.md
IndexedLog is the main on-disk storage format used by various components in Sapling.
Historically, revlog was the main storage format, but it has a few limitations:
Git's format (loose and pack files) does not have the
topological order limitation, and lookup by hash is ideally O(log N), unless
there are too many pack files. But it requires periodical repack to maintain
performance, and does not support the multi-index use-case.
SQLite is a powerful library that satisfies the multi-index use-cases, and can maintain time complexity without "repack". However, the main problem is that historically we use the "append-only" strategy to achieve lock-free reads, but SQLite requires locking for both read and write.
IndexedLog is built to achieve the following desired properties:
repack to maintain performance.In addition, we think the property below is nice to have:
sed -i on the
repository data, we want to understand exactly what parts of the data are
corrupted, and have a way to recover.An IndexedLog consists of a Log and multiple surrounding Indexes. The Log is the single source of truth for the data. The indexes are derived purely from the Log and index functions. Corrupted indexes can be deleted and rebuilt as long as the Log is fine.
A Log stores entries in insertion order. An entry consists of a slice of
bytes. Log is interface-wise similar to LinkedList<Vec<u8>>, but only
accepts push_back, not push_front.
A Log supports the following operations:
A Log does not support:
Unlike a relational or document database, Log itself does not define the meaning of the bytes in an entry. The meaning is up to the application to decide.
A Log can have multiple indexes. An index has a name, and a pure function that
takes the byte slice of an entry, and outputs a list of IndexOutputs.
Each IndexOutput is an enum that instructs IndexedLog to do one of the
following:
Unlike databases, the index function is in native Rust and not a separate language like SQL or JSON. Note that it is impossible to serialize a compiled Rust function to disk, so applications need to provide the exact same index functions when loading an IndexedLog from disk.
If you change an index function, you need to also change the index name to prevent the IndexedLog from picking up the wrong index data.
The Index can be used independently from Log. Its interface is similar
to BTreeMap<Vec<u8>, LinkedList<u64>>, but uses the filesystem instead of
memory for the main storage.
An Index supports:
Internally, the key portion of the index is a
radix tree where a node has 16
children (4 bits). This is used to support hex prefix lookup. The value portion
is a singly-linked list which supports push_front, but not push_back.
The on-disk format uses persistent data structure to achieve lock-free reading. The main index is append-only. The pointer to the root tree node is a small piece of data tracked separately using atomic-replace.
When used together with a Log, the u64 part of LinkedList<u64> is used as
file offsets. The offsets are not exposed in public APIs of Log to avoid
misuse. The Log allows the on-disk indexes to lag for some entries because
updating the index for 1 entry takes O(log N) space, inefficient for frequent
small writes. The lagging portion of index will be built on demand in memory.
When an IndexedLog (or a standalone Index) gets loaded from disk, it is like taking a snapshot. Changes on disk afterwards won't affect the already loaded IndexedLog (as long as all writes to the files go through IndexedLog APIs).
Writes are buffered in memory, lock-free. They are invisible to other processes or other already loaded IndexedLogs.
A sync operation is needed to write data to disk, or load changed data from
disk. The sync will take a filesystem lock to prevent other writers, pick up
the latest filesystem state if anything has changed on disk, write the updated
log and indexes to disk, then release the lock.
If 2 processes (or 2 IndexedLogs in one process) are sync()-ing to the same
IndexedLog on disk concurrently, both their pending changes will be written.
The order of the written data is unspecified, depends on which one obtains the
filesystem lock first.
Both Log and Index use xxhash for data integrity. Log writes a XXH32 or XXH64 checksum per entry depending on the size of the entry. Index internally maintains checksum entries per 1MB data. All data reading triggers integrity checks. Errors will be reported to the application.
IndexedLog supports a "repair" operation which truncates the Log to entries that pass the integrity check and then rebuilds the corrupted or outdated indexes.
RotateLog applies the log rotation idea to IndexedLog. RotateLog maintains a list of Logs. When a Log exceeds certain size limit, RotateLog creates a new Log and optionally delete the oldest ones.
RotateLog is intended to be used for client-side caching, where the client wants space usage to be bounded, and the data can be re-fetched from the server.