GVFS/GVFS.Virtualization/Projection/Readme.md
This document is to help give developers a better understanding of the GitIndexProjection class and associated classes and the design and architectural decisions that went into it. In simplest terms the purpose of the GitIndexProjection class is to parse the .git/index file and build an in-memory tree representation of the directories and files that are used when a file system request comes from the virtual file system driver. GVFS.Mount.exe keeps an instance of this class in-memory for the lifetime of the process. This helps VFSForGit quickly return file system operations such as enumeration or on-demand hydration. VFSForGit uses the skip worktree bit to know what to include in the projection data and what files git will be keeping up to date. Currently VFSForGit only supports using version 4 of the index. Details on the index format and version 4 can be found here.
This code was designed for incredibly large repositories (over 3 million files and 500K folders), there are multiple internal classes that are used to help with the prioritized objectives of:
Some things used to acheive these are:
unsafe code and fixed pointers for speed.These are some of the processes that use the GitIndexProjection.
Enumeration is tracked on a per call basis with a Guid and an ActiveEnumeration so that multiple enumerations can run and be restarted without affecting each other.
IRequiredCallbacks.StartDirectoryEnumerationCallbackProjectedFileInfo objectsIRequiredCallbacks.GetPlaceholderInfoCallbackIRequiredCallbacks.GetFileDataCallbackIWriteBuffer returned by a call to the virtualization instance's CreateWriteBuffer method.NamedPipeMessages.AcquireLock.AcquireRequest).NamedPipeMessages.ModifiedPaths.ListRequest).NamedPipeMessages.DownloadObject.DownloadRequest).NamedPipeMessages.PostIndexChanged.NotificationRequest). This will wait for the hook to return before continuing. This is important because the hook is when the projection is updated and needs to be complete before git continues or it may see the wrong projection.
NamedPipeMessages.ReleaseLock.Request).
FileTypeAndModeClass only used for file systems that support file mode since that is in the git index and is needed when the file is created on disk.
PoolAllocationMultipliersClass used to hold the multipliers that are applied to the various pools in the code. These numbers come from running with various sized repos and determining what was best for keeping the pools at reasonable sizes.
ObjectPoolClass that is a generic pool of some type of object that will dynamically grow or can be shrunk to free objects when too many get allocated. All objects for the pool are created at the time the pool is expanded. The LazyUTF8String.BytePool is a specialized pool to allow the use of a pointer into the allocated byte[].
FolderEntryDataAbstract base class for data about an item that is in a folder. Contains the name and a flag for whether the entry is a folder. FolderData and FileData are the derived classes for this class.
FolderDataClass containing the data about a folder in the projection. Includes the child entries as a SortedFolderEntries object, a flag to indicate the children's sizes have been populated, and a flag to indicate if the folder should be included in the projection (This is when using sparse mode).
FileDataClass containing the data about a file in the projection. Includes the size and the SHA1. The SHA1 is stored as 2 ulong and an uint for performance and memory usage.
LazyUTF8StringClass used to keep track of the string from the index that is in the BytePool and converts from the BytePool to a string on when needed by either calling the GetString method or Compare when one string is not all ASCII.
SortedFolderEntriesClass used to keep the list entries for a folder, either FolderData or FileData objects) in sorted order. This class keeps the static pool of both FolderData and FileData objects for reuse.
Couple of things to note:
This is using the Compare method of the LazyUTF8String class for a performance optimization since most of the time the paths in the index are ASCII and the code can do byte by byte comparison and not have to convert to a string object and then compare which is a performance and memory hit.
When getting the index of the name in the sorted entries it will return the bitwise complement of the index where the item should be inserted. This was done to avoid making one call to determine if the name exists and a second call to get the index for insertion.
SparseFolderDataClass used to keep the sparse folder information. It contains a flag for whether the folder should be recursed into for projection, the depth of the folder, and the children in a name, data Dictionary<string, SparseFolderData>.
When sparse mode is enable this data is used to determine which folders should be included in the projection. A root instance (rootSparseFolder) is kept in the GitIndexProjection which is not recursive and only files in the root folder are being projected when there aren't any other sparse folders. When sparse folders are added via the SparseVerb, the children of the root instance are inserted or removed accordingly.
For example when gvfs sparse --set foo/bar/example;other runs, there will be 2 sparse folders, foo/bar/example and other.
`rootSparseFolder` in the `GitIndexProjection` would have:
Children:
|- foo (IsRecursive = false, Depth = 0)
| Children:
| |- bar (IsRecursive = false, Depth = 1)
Children:
| |- example (IsRecursive = true, Depth = 2)
|
|- other (IsRecursive = true, Depth = 0)
This will cause the root folder to have files and folders for foo and other. foo will only have the bar folder and all its files, but no other folders will be projected. The foo/bar folder will only have the example folder and all its files, but no other folders will be projected. The foo/bar/example and other folders will have all child files and folders projected recursively.
GitIndexEntryClass used to store the data from the index about a single entry. There is only one instance of this class used during index parsing and it is reused for each index entry. The reason for this is that version 4 of the git index has the path prefix compressed and the previous path is needed to create the path for the current entry. The code in this class is heavily optimized to make parsing the index and the paths as fast as possible.
GitIndexParserClass that is responsible for parsing the git index based of version 4. Please see index-format.txt for detailed information of this format. This is used to both validate the index and build the projection. It currently ignores all index extensions and is only for getting the paths and building the tree using the FolderData and FileData classes. The index is read in chunks of 512K which gave the best performance.
ProjectedFileInfoClass used to hold the data that is used by FileSystemVirtualizer when enumerating or creating placeholders.
GitIndexProjectionClass used to hold the projection data and keep it up to date. This code uses and can be called from multiple threads. It is using ReaderWriterLockSlim to synchronize access to the projection and ResetEvents for waiting and notification of events. There are caches for a variety of objects that are used.
Found in the Initialize method and does the following:
There is a thread started when the class is initialized that waits to be woken up to parse the index. Events are used to indicate when the parsing is complete to make sure that the projection is in a good state before using it.
When woken the parsing thread will:
This project is used to specifically test the memory and performance of parsing the index and building the projection. There are three tests that can be ran: ValidateIndex, RebuildProjection, and ValidateModifiedPaths. The IProfilerOnlyIndexProjection interface is used to expose the methods for use in this project only. Options can be used to limit which tests run. Each test runs 11 times skipping the first run and getting the average of the last 10. Memory is tracked and displayed as well to make sure it stays consistent.