Using Git for Incremental Backups
Secure, incremental backups of folders of text-based data

21 December 2017

A few months ago, I decided that I was done with letting Google keep my email archive. Don’t get me wrong, I still think that Gmail is a wonderful tool, but I wanted to have more control over my own data. This meant setting up scripts to check Gmail for any new emails on a regular basis and downloading it.

I’ll skip over the details of the downloading to what I ended up with: a directory full of thousands of plain text files. I chose to store my emails using the Maildir standard, which means each email is in its own file. I use Gnome Evolution to read my emails, and I was immediately surprised and pleased by how snappy an email client can be if all of the emails are already on your hard drive.

Having all of these emails locally though, I needed to think about how I was going to protect them from being lost. I’ve written about backups before, and since I didn’t want to lose my email archive I needed to set up regular backups. On the other hand, my emails sometimes contain personal information. I’d rather not just put them out into the world as is.

After some reading, I came up with a solution and a simple shell script to implement it. First, I would put my emails into a Git repository. If you don’t know what a Git repository is, I’ve written an introduction to version control. I then use a lesser known feature of Git called bundles to make incremental backups of my emails. I use GPG to encrypt the bundles, and then I upload them to a cloud hosting site.

Let’s break down each of those pieces.

Git Bundles

The first thing that my daily backup script does is add any changes I’ve made to a Git repository. The commit created will be the list of changes that have been made to files since the last commit. If I backup the commits, I have a backup of the data.

This is actually a slightly simpler problem to tackle since my Git repo will only ever be appending new commits. Even if I delete an email, which would mean tracking that a file has been deleted, in the Git repository that’s the addition of a commit. To further simplify matters, I’m not using branching at all here so I only have one linear path of Git commits in the repository to worry about.

Git bundles are files that Git lets you create that contain commits. For example, if you enter this command:

git bundle create master.bundle master

then it will create a single file, called master.bundle, which contains all of the commits on the master branch. If you’re creating a backup of the master branch, then you just need to copy that bundle somewhere safe.

But I promised you incremental backups. That means I need to create a bundle that only has the changes since the last backup. This is a very similar process.

git bundle create start-to-master.bundle start..master

This will create a bundle that has everything in master, except anything that is also in start. If start.bundle was the last backup, then I can recover my repository if I have both start.bundle and start-to-master.bundle.

Encryption with GPG

GPG, the Gnu Privacy Guard, is a program for encrypting files. If I get into explaining all of the important parts of GPG, then we won’t ever get back to the backup script. The short version is that if I take the Git bundle and push it through GPG, I can get out a version that I will be able to decrypt and read, but nobody else can without the key.

Importantly, this means that after I upload my backups to the Internet, if somebody accidentally gets access to them they still can’t read the contents.

If you’re interested in learning more about GPG, check out the GPG website.

Rsync to CloudFiles

When I initially set up this script, it sent the backups to Backblaze’s B2 storage. It worked well, after I installed a separate command line program to connect to their API. I later discovered an option that supports the excellent set of tools that come preinstalled with most Linux distributions: SSH keys and Rsync!

CloudFiles is file hosting service provided by the South African company CloudAfrica. For me, the killer feature of CloudFiles is that I can give them the SSH public key for my computer, and use Rsync to upload or download files. I would recommend CloudFiles to anyone who wants a drive somewhere on the cloud that they can just Rsync their data to.

Bashing it all Together

Here is the script that I wrote to pull this all together. It’s triggered on a regular schedule, more or less daily.


# This will make the script as a whole exit when it first encounters
# an error.
set -e

echo "Starting email backup"

# This is the directory that I'm backing up.
# It will create the bundles in this subdirectory. Make sure you've
# added this subdirectory to your .gitignore file.


# If there aren't any backups yet, create the folder. This will be the
# first backup, and also the first time after you restore from a
# backup.
if [ ! -d "$BACKUP_DIR" ]
    echo "Backup directory does not exist. Creating it"
    mkdir -p $BACKUP_DIR

# Create an automated commit of the latest changes, but only if
# there's something to commit.
git add -A
if [ "`git diff --cached --name-only`" ]
    git commit -m "Automated commit from backup util"

# I'm calling the commits here checkpoints, because I can't help in my
# head thinking of it as saving my progress in a game. It's really
# just the id of the last commit.
CURRENT_CHECKPOINT=`git rev-parse --verify HEAD`
# Notice that the filename contains a timestamp, such that if I want
# these in chronological order I can sort by filename.
BUNDLE_NAME="mail."`date +%Y%m%d%H%M%S`"."$CURRENT_CHECKPOINT".bundle"

# This is the most recently created bundle. I'm keeping all of the
# bundles in the backup folder so that I can do this.
LAST_BUNDLE=`find $BACKUP_DIR -name '*.bundle' | sort -r | head -n 1`
if [ "$LAST_BUNDLE" == "" ]
    # There was no previous bundle, so this bundle just gets all of
    # the commits.
    echo "first backup bundle"
    git bundle create "$BUNDLE" HEAD
    # There's a previous bundle, so we can exclude the commits from
    # that bundle.
    echo "basing backup on previous bundle, $LAST_BUNDLE"
    LAST_CHECKPOINT=`git bundle list-heads $LAST_BUNDLE | cut -d' ' -f1`
    echo "last commit was $LAST_CHECKPOINT"
        git bundle create "$BUNDLE" $LAST_CHECKPOINT..HEAD
        echo "nothing new to backup"

# If there wasn't anything new in the bundle, we wouldn't have created
# it. So only if there's something new, then we need to upload.
if [ -f "$BUNDLE" ]
    # I encrypt the bundle using my GPG public key.
    echo "Encrypting bundle"
    gpg -r justin -e "$BUNDLE"

    # Rsync the gpg file over to Cloudfiles! Because of the fickle
    # nature of network connections, this is the point which is most
    # likely to fail. If the network is down, Rsync will fail to
    # upload it, and so Rsync won't delete it. The next time the
    # script runs, when it gets here, it will upload the bundle that
    # didn't make it last time.
    rsync -avz --remove-source-files --ignore-existing \
          $BACKUP_DIR/*.bundle.gpg \


echo "Finished backup utility"


No backup strategy is complete without some way to recover the backed up data. Here is my script to do that.


set -e

# To recover the backups, first we need to get the backups
mkdir /tmp/backup

# The --protect-args part here makes sure the * is actually sent to
# the CloudFiles server, and our local bash doesn't try to expand
# it. Our local bash can't figure out all the files on CloudFiles.
rsync -avz --progress --protect-args \
      "*.bundle.gpg" \

# This will prompt you for your private key password. Most Linux
# distributions come with gpg-agent, so you'll only need to enter the
# password once.
gpg --decrypt-files /tmp/backup/*.bundle.gpg

# This is where we're restoring to. Change as necessary.
mkdir recovery
cd recovery

# This is the git trick. It's just doing a fetch from each Git bundle,
# in order (remember the filenames have a datestamp in them), and
# calling the end of the last one the master branch.
git init
find /tmp/backup/ -name '*.bundle' | sort | xargs -n1 -I'{}' git fetch {}
git checkout -B master FETCH_HEAD

Using Systemd to Run Automatically

Any backup script works best if you set up a schedule to run it automatically in the background and forget about it. On my computer, this is managed by Systemd. It work equally well no matter how you trigger it, so if you prefer something like Cron that’s also great.

It’s simply a file called backup-email.service, which tells Systemd how to run my script. If the backup fails, then the config also tells Systemd how to send me an email about it.

Description=Backup my maildir to cloudfiles



And a file called backup-email.timer, to tell Systemd when to run my script.

Description=Backup email to cloudfiles




The way that I’m using Git in this script assumes that there aren’t ever any branches. Since I let the script do everything with Git here, and don’t use it as a Git repository otherwise, this is a fair assumption. If you’re planning on backing up an active Git repository, you can do that with bundles, but you’ll need to consider which branches you’re interested in.

Git is very good at finding the changes in text files. If the files that you’re working with are not text files, like images, audio, video, or other binary formats, you may find that your Git repository starts to get large over time. For these sorts of files, there may be better options than using Git. Thankfully, Git does work exceptionally well for things like my emails.

The cryptography in this blog is a relatively simplistic approach to prevent people from reading the data in the backups. It would still be theoretically possible for an attacker to do nasty things like replacing the backups with something completely different without you knowing. If you’re feeling extra paranoid, you could look into also using a GPG private key to sign the bundles before uploading them.

That’s all there is to it

I hope this script will help you in thinking about how to set up your own backups to meet your needs. Don’t forget, while you’re at it, to have fun.