website/docs/dev/internals/copytracing.md
Copy tracing is a technique used to efficiently account for file copies and
renames when comparing histories. It is used for diff commands and merge-related
operations, such as rebase, graft, and merge. It simplifies resolving merge conflicts,
especially when large refactors, like directory renames, occur in frequently updated
repositories. Below is an example of a rebase involving file renames:
$ sl
@ d78558192 1 second ago alyssa
│ update b
│
o 2b089a0d8 15 seconds ago alyssa
│ mv a -> b
│
│ o b0d1b083d 36 seconds ago alyssa
├─╯ update a
│
o 5b0d97d5a 46 seconds ago alyssa
add a
Without copy tracing, sl previously had to ask about renamed or copied files
during rebases:
$ sl rebase -s b0d1b083d -d d78558192
rebasing b0d1b083d791 "update a"
other [source (being rebased)] changed a which local [dest (rebasing onto)] is missing
hint: if this is due to a renamed file, you can manually input the renamed path
use (c)hanged version, leave (d)eleted, or leave (u)nresolved, or input (r)enamed path?
With copy tracing, the merge just works, despite there being copied or renamed files:
$ sl rebase -s b0d1b083d -d d78558192
rebasing b0d1b083d791 "update a"
merging b and a to b
b0d1b083d791 -> 92444cbb366b "update a"
Historically, Sapling has used two copy-tracing solutions. However, as our monorepos grow (tens of millions of files, tens of millions of commits), the previous solutions have become too slow for production use:
O(M * N * H), where N and H are huge.
O(M * H), where H remains large and there is a large
constant factor for reducing N to K.Before we explore bisect-based copy tracing, let's first examine how Git's rename detection works. Git's rename detection is similar to the heuristics copy tracing mentioned earlier, but it includes additional heuristics and strategies to enhance performance, such as "Remembering previous work", "Exact renames".
O(M * S), where M is the same as above,
S is the complexity of file content similarity computation.O(M * N * S).Bisect-based copy tracing is built to achieve the following desired properties:
O(M * log H) time complexity, it bisects the file history
rather than scanning commits sequentially.The problem that copy tracing solves is: given two commits, C1 and C2, and a path P1 in C1, we need to find the renamed path P2 in C2.
This problem required a new algorithmic design to scale efficiently. The basic idea is to break the problem into two steps:
C3 that deletes P1 in the C1 to C2 range.C3, find what path P1 was renamed to. If that path exists in C2,
then we’re done. Otherwise recursively trace renames in the C3 to C2 range.The efficient bisect is based on the Segmented Changelog we developed for lazy commit graph downloading and improving DAG operations, please check this blog post to learn more about Segmented Changelog.
Since Sapling can efficiently trace rename commits by bisecting the history, and then find the renames in a rename commit, we don't need heuristics to reduce the large number N on destination side. This approach allows Sapling to detect renames that would otherwise be missed by heuristics-based methods.
We made the rename detection inside a commit abstracted. Whether it’s Sapling’s tracked rename, or Git’s implicit content similar rename, or a combination of them, they fit in the same abstraction and can be flexibly configured.
Typical content similarity libraries often degrade to O(N^2) in the worst case,
where N is the line count (O(N^2) is the worst case for the Myers diff algorithm).
Our approach, xdiff::edit_cost, imposes a max cost limit, reducing the
complexity to O(N).
When renames cannot be found, for example, file a.txt was renamed to a.md
and then deleted on the destination branch, the new copy tracing can identify
both the commit that renamed the file and also the commit that eventually
deleted it. This allows us to provide additional context to help resolve the
conflict:
$ sl rebase -s 108b59d42 -d a1fcdc96b
...
other [source (being rebased)] changed a.txt which local [dest (rebasing onto)] is missing
hint: the missing file was probably deleted by commit 7f48dc97d540 with name 'a.md' in the branch rebasing onto
use (c)hanged version, leave (d)eleted, or leave (u)nresolved, or input (r)enamed path?