wiki/ documentation/ gitlab/ HowTo rewrite history

How to remove binary large objects (BLOB) from git history and keep them available afterwards.

Disclaimer: This is not meant to be step-by-step instructions covering every possible case, but rather an overview over which tools you can use to solve this problem. Consult git and related documentation, if your case deviates from this example.

Public/Shared Repos

This is the more general option, since what is listed for private repos is not generally suitable for keeping your repository related data publicly available.

  1. Find out which binary files you have committed in your repo
    • Commandline tools like ncdu help in finding large files.
    • For the purpose of this example let's say you found data/my_training_set.bin
  2. Backup the BLOB outside the repo
    • zip ../data.zip data/my_training_set.bin
  3. Check the BLOB's history
    • tig data/my_training_set.bin
  4. Remove the file from your history
    • 4.1 In a small repository
      • git filter-branch --tree-filter 'rm -f data/my_training_set.bin'
      • filter-branch performs poorly on larger numbers of commits
    • 4.2 In a large repository
      • Find the commit ID where data/my_training_set.bin was introduced, e.g. with tig data/my_training_set.bin
      • git rebase -i <commit-ID>~1
      • e.g. git rebase -i 272e7241548d564c3b13f15865cc5fb3c8058e82~1
      • Follow the instructions in your editor to pick your commit ID for edit.
      • git rm data/my_training_set.bin
      • git commit --amend
      • git rebase --continue
      • Repeat 4.2 if you commited changes to the blob throughout your history.
  5. Distribute your new history
    • git push --force
    • In case of an error while pushing, make sure your repository settings in GitLab allow you to force push.
  6. Distribute your data.zip
    The university offers limited but nonetheless existing options to distribute large files publicly.
    • faubox
    • wwwcip.cs.fau.de/~<your_idm_username> which points to ~/.www/ in your cip user home, if you have access to the cip pools. You can also add symlinks to /proj/ciptmp/<your_idm_username>/ for more space, but beware that there is no backup for the ciptmp
    • If neither of these are suitable for you, ask your affiliated chair to provide a platform to distribute binary files.

Private Repos

In addition to the steps for public repos, git annex provides useful functionality for managing binary data in a git repository. See annex documentation. The webdav remote can be used to integrate faubox as your annex storage. See annex webdav documentation and faubox webdav documentation. However git annex would need a GitLab like platform with annex support to be suitable for public repositories. FAU currently offers no such platform.

Further Thought