Error Reporting from your Systemd Automation
Having stuff magically happen in the background is fantastic, but what about when it goes wrong?

07 November 2017

I’ve been using Linux for a bit more than 5 years now. One of the things that I’ve come to love about it is how easy it is to automate repetitive tasks. For example, in a previous post, I wrote about a system that I set up for syncing files between my computers. I’d decided that I wanted to take more direct control over my files than other cloud syncing software was letting me, and I was already using Git extensively to version control projects. This lead me to the idea that I could use Git to sync my notes between computers.

The Git script that I wrote has happily been running for more than a year now. The abridged version of how it works is that:

  1. I have a shell script that will automatically commit, pull and push any changes.
  2. I configured Systemd to run that script every 15 minutes.

I use a similar idea for scripts that do backups, download my email, and sync my Emacs org-mode calendar to a Google calendar. There’s a script to do the actual work, and some Systemd configuration files that tell my system how and when to run it.

Something that I didn’t cover in that previous post was how to handle failures gracefully. If the script failed because there was a merge conflict or my network was down, the whole thing would quietly stop working and I wouldn’t know about it.

Today I’d like to share the change I made so that I would know when things went wrong.

Wait, what’s Systemd?

Systemd is a Linux program that manages booting up the system, and running system processes. It happens to be program that Arch Linux uses for this, which is the distribution of Linux that I run at home.

You can tell your Systemd what to run by creating unit files. These are simple configuration files which tell Systemd how to do a thing. Going back to the Git syncing system example, there are two unit files.

The first defines a Systemd ‘service’, autosync-repos.service, which tells Systemd how to run my script for syncing my Git repositories. It isn’t the script itself, just the metadata around running it like command line arguments, a description of the service, and how nice the program sould be. The second is a Systemd ‘timer’, autosync-repos.timer, which tells Systemd to trigger the service on a schedule, every 15 minutes.

If you don’t have Systemd on your computer, other Linux distributions would achieve a similar effect using a Cron job, Mac would use Launchd, and Windows would use the Windows Task Scheduler. In this post I’m specifically talking about Systemd, but you can probably adapt the advice to any of those ecosystems.

I want to send myself an email about it

I decided that the best way to get a notification about the failed script would be an email. I already download my email using a different script, and store it using the Maildir standard, so all I need to do is write a script to inject it into that pipeline.

Write a script for when your script fails

The first step to getting an error report when something goes wrong in a script is to write more scripts. This is the one I use. If you call it with a different Systemd service as a command line argument, it will send me an email with the log from the last time that job ran.

#!/bin/sh

cat <<EOF | procmail
To: justin@worthe-it.co.za
From: systemd@$HOSTNAME
Subject: OnFailure Email for $1

# Status
$(systemctl --user status -l -n 1000 "$1")
EOF

Procmail is the program that takes my downloaded email, applies any email rules for me, and files it into my mail directory. All I need to do in my script is call it and pass it an email on the standard input. I also have a Procmain sorting rule to put these emails into their own folder, so that automated notifications don’t make me miss emails from real people.

It doesn’t matter how you receive your error notifications. Post it to Slack, send yourself a push notification, change your wallpaper… whatever works best for you. I would however recommend having an option for if your Internet is down. Most of the times my scripts fail is because of network troubles. Since Procmail is putting the email directly into my mail directory, it doesn’t actually go through the Internet at all.

Wrap it in a Systemd instantiated unit

The next thing we need is another Systemd service that can run our notification script. Importantly, we need this Systemd service to know which other service just failed.

This is an ideal use case for a Systemd ‘instantiated unit’. It’s like a normal Systemd service, but it needs to be given a command line argument when it’s run. It’s perfect for passing in which service failed.

Systemd knows that it’s an instantiated service because the filename has an @ in it. When you’re defining the service, you can check what was passed in after that @ by writing %i.

# Saved in ~/auto/failure-email@.service

# %i is templated out to whatever was put after the
# @ when this was called.
[Unit]
Description=OnFailure email for %i

[Service]
Type=oneshot
ExecStart=/home/justin/auto/email-on-failure.sh %i

You enable your new instantiated service by using the link command.

systemctl --user link /home/justin/auto/failure-email@.service 

You can now give the failure email service a whirl. Start it the same way that you would any other Systemd service, but pass it the service you’re saying has failed after the @ in the filename.

systemctl --user start failure-email@autosync-repos.service 

Add it to my other Systemd services

The [Unit] section of Systemd’s service configuration files has an OnFailure property. If you set that to be the name of another Systemd service, it will run that service if the script fails. That’s exactly what we want!

We use some templating to put the name of the current Systemd service as the parameter for our failure service.

# Saved in ~/auto/autosync-repos.service

# %n templates to the current service's name. 
# In this case, %n is autosync-repos

[Unit]
Description=Automatically does git push and pull on a number of repos
OnFailure=failure-email@%n.service

[Service]
Type=simple
ExecStart=/home/justin/auto/autosync-repos.sh
Nice=19

[Install]
WantedBy=autosync-repos.target

Make sure that your services return a failure

Systemd will decide if a script failed or succeeded based on its exit code. The convention is that an exit code of 0 is a success. Anything else is a failure.

Luckily, most programs you use as part of your scripts on Linux should already be following this convention. You just need to make sure you bubble it up.

Your script can be written in any language, as long as it follows the rule of returning an exit code of 0 if it succeeded, and a non-zero exit code if it failed.

Pros and Cons

The first obvious pro is that if something goes wrong in one of my scripts, I find out about it. It’s been really useful to include the log message of the last run in the emails to help figure out what went wrong.

The con of course is that sometimes things are going wrong in such a way that it hits all of the jobs, like the network going down, and then I get a massive pile of mails. This isn’t too bad a problem for me. The notifications are already sorted into their own folder, making it easy to bulk delete them once the issue is dealt with.

Another con is that you need to manually go to any jobs that you want a notification from and copy paste that OnFailure line. It’s a bit tedious, but at least you only need to do it once.

Knowing is half the battle

Of course, none of this is actually going to make your scripts more reliable on its own, but hopefully by making it more visible when things go wrong you’ll be in a good position to make improvements.

Have fun!