How to remove unwanted files from your Git repository

,
Author information: Dan Braghis , Principal Engineer and Wagtail Consultant , Post information: , 3 min read ,
Related post categories: Digital products ,

Git is the version control system of choice at Torchbox. It is powerful, fast, efficient and can be used on both small and very large projects.

A rule of thumb for any repository is to keep uploaded media, large files and sensitive information out. Saying this, accidents do happen. Due to an accidental commit, our website repository size recently increased by 8x to a staggering 700MB. These are the steps we took to clean it up.

When it happened, Nick helpfully suggested the following radical approach:

Nick's nuclear git repo clean suggestion
Nick's radical suggestion

These are the actual steps we took to clean it up:

Revert and rebase

First, we tried the simplest approach: revert and rebase the branch.

This would have been an ideal solution, were it not for Git’s guarantee of 'what goes in, will come out'. In simple terms, each commit id depends on all of the commits that came before it. To change a commit, you need to go through all of its history - from oldest to newest. Git rebase will rewrite the history, but to delete a file for good you'll need to remove all copies from the entire tree.

Git filter-branch

To apply filters, such as removing a file when rewriting the commit history, you must use Git filter-branch. It's the default go-to method for repository cleanups.

Git filter-branch runs a filter that removes the unnecessary files. You can then manually remove all original references, expire all the records in the Git replay log and run the garbage collector to the tainted data.

While it is a powerful tool, filter-branch takes a very long time on repositories with many commits, as it steps through every commit in your repository. It can also be fiddly.

BFG Repo cleaner

Enter BFG. The BFG is an alternative to the Git filter-branch method. It is written in Scala and can be anywhere from 10 to 50 times faster (it's simpler to use too). You can remove files, directories, anything larger than a certain size and even run text replacement in files.

The typical workflow is:

This creates a mirror of the culprit repository, runs BFG to strip out anything larger than 100MB, expires old records and runs the Git garbage collector to strip the unwanted dirty data.

In our case, we ran:

The above removed the media.old directory and files, but protected anything that was in the master branch which we knew to be clean.

As a result, the repository went from 696MB back down to 85MB. Not a bad result at all.

Post-cleanup

Once the repository was clean again, we had to ensure everyone else’s copy was fixed too.

To do this, we deleted the old repository and cloned it again.

It is important to keep in mind that if any team members have work in progress, it is best to rebase any branches they created off the old repository. A single push or a merge can pollute the history and undo the cleanup.

What can you do to prevent this from happening

Avoid the catch-all commands git add . and git commit -a on the command line. Instead, use git add the_filename.ext, git rm the_filename.ext to individually stage files. If you are uncomfortable with the command line, use a visual tool such as Tower, GitHub Desktop or GitKraken and check exactly what will be committed.

Useful resources

Finally, here is a list of useful articles and tools:

,
Author information: Dan Braghis , Principal Engineer and Wagtail Consultant , Post information: , 3 min read ,
Related post categories: Digital products ,