etc/corpus/README.md
gitoxide uses a corpus of popular git repositories to validate its own algorithms against.
This file contains all information needed to build a corpus of varying sizes and to run gitoxide against it.
The corpus is a set of git repositories to run algorithms against. The set is a filtered list…
which is then cloned to disk into a directory structure that mirrors its URL.
The ~1.8GB file can be downloaded here https://www.kaggle.com/datasets/pelmers/github-repository-metadata-with-5-stars (needs account). When downloaded it has to be converted to JSONL for consumption.
# convert the downloaded JSON into JSONL into the same directory
json-to-jsonl.sh repo_metadata.json
Skipping all repositories larger than 5GB and filling a limit, one can use the Jupyter Notebook at https://www.kaggle.com/code/pelmers/explore-github-repository-metadata (*account required).
Add the following snippet to the notebook and adjust limit to your needs.
five_gb_in_kb = 5 * 1024 * 1024
limit = 3500 * 1024 * 1024
# Order by 'stars' column and filter by 'diskUsageKb'
df = df[df['diskUsageKb'] < five_gb_in_kb]
df_sorted = df.sort_values(by='stars', ascending=False)
# Calculate how many entries would fit into 350GB
disk_usage_cumsum = df_sorted['diskUsageKb'].cumsum()
(disk_usage_cumsum <= limit).sum()
In the example above, one would manage to fit 68568 repositories into 3.5TB.
Run head -n 999 repo_metadata.sample.jsonl | ./clone-repos.sh <corpus> to clone into the given <corpus> location, or any other invocation with
your respective repo_metadata.jsonl and the computed amount of repos to include as in head -n <your-count> <your.jsonl>.
Invoke git clone --bare https://github.com/NagatoDEV/PlayStation-Home-Master-Archive <corpus>/github.com/NagatoDEV/PlayStation-Home-Master-Archive.git (after replacing <curpus> with your base path)
to obtain one sample of a huge repository with a lot of assets and other binary data whose tree spans more than 440k files.
That way, we also get to see what happens when we have to handle huge binary files in massive trees.
Another massive tree and a more than 1.3m commits comes in with this invocation:
git clone --bare https://github.com/archlinux/svntogit-community <corpus>/github.com/archlinux/svntogit-community.git.
This repo has 100MB+ files with a lot of append-only changes to it, giving it a very imbalanced delta-tree that triggers worst-case behaviour that needed special mitigations:
git clone --bare https://github.com/fz139/vigruzki <corpus>/github.com/fz139/vigruzki.git.
All repos should be topped off with…
cd <corpus>
for d in github.com/archlinux/svntogit-community.git github.com/NagatoDEV/PlayStation-Home-Master-Archive.git github.com/fz139/vigruzki.git; do
git -C $d read-tree @
git -C $d commit-graph write --no-progress --reachable
done
gix commands by handSometimes it's interesting to try a new command against all available repositories to see if it fails:
ein t find <corpus> | xargs -P10 -I {} bash -c 'echo {}; gix -r {} <command>
gix corpusThe corpus sub-command runs specifically implemented commands against the corpus in a parallel fashion and stores these results in a local sqlite database for
later comparison.
TBD