Performance Tuning in Rust using Benchmarking and Perf
How do you take a program, and make it do the same thing faster?

23 September 2018

As Donald Knuth said, “premature optimization is the root of all evil”. What does it mean for us to avoid premature optimization, and instead for us to just do the good, mature, optimizations? Luckily, Knuth explained further that the trick is to identify the 3% of the program where most of the execution time is being spent, and spend your time optimizing that. Optimizing any other parts of your program is a waste of time, and will probably result in less maintainable code.

In my previous article, Coding for the Win - How I Built a Tower Defence Bot, I discussed the entry I wrote for this year’s Entelect Challenge. My entry to the competition relies on an algorithm called the Monte Carlo Tree Search to choose a good move. My bot would make better moves the more random games I could have it simulate. Since the competition rules said I only had 2 seconds in which to choose my move, I needed to make this simulation run as fast as I could. I’ve put up the full source code for my bot on GitHub, if you’re interested in taking a look.

In this article, I’ll be showing the tools that I used to identify those hot spots in my code, and the techniques I used to ensure that my changes actually were making the overall execution speed faster.

Use Perf to find hot spots

Perf is a profiler that hooks into the Linux kernel and provides you with statistics on your program’s runtime performance. It was a good choice for me, since my primary development environment is Linux.

The first thing you need to consider when profiling is setting up your code to work well with a profiler. In Rust, you can choose to compile your code in either ‘debug’ or ‘release’ mode. Debug mode compiles faster, but is slightly slower when it runs. Release mode takes a bit longer to compile, but runs faster. Usually, when you care about performance, you’ll compile in release mode, so that’s what you need to profile. Unfortunately, compiling in release mode also strips out the ‘debug symbols’ from the compiled code. Without debug symbols, you won’t be able to tell how the speed of the compiled assembly code relates to your source code.

You can turn the debug symbols on in release mode by putting these lines in your Cargo.toml file:

[profile.release]
debug = true

The next step is to compile your code in release mode.

cargo build --release

This will place a copy of any binaries you defined in your Rust profile into ./target/release. I have two binaries, the default one, which is called in the tournament, and one called perf-test, which has hard coded the input data.

You need to run perf, passing it your test program as an argument, like so:

perf record -g target/release/perf-test

If all worked correctly, then the output will look something like this:

[ perf record: Woken up 17 times to write data ]
[ perf record: Captured and wrote 4.320 MB perf.data (62088 samples) ]

The first time I ran this on my computer, I got an error. I needed to edit some user permissions to give myself access to the kernel’s performance counters to make it work. If this happens to you, the error message was pretty good and included the steps I needed to follow.

Now that the data has been captured, you can inspect it by running

perf report

/assets/posts/tower-defence/perf.png

From this, you immediately see that a quarter of my time is spent moving missiles around, and another quarter of my time is spent figuring out how to choose a random move.

If you select the top function, you can see how time is being spent in the function where I update my missiles.

/assets/posts/tower-defence/perf-report.png

Lines like opponent.buildings[health_tier] &= !hits are from my original source code. They’re the debugging symbols that we put back in. From this, I have a fairly good idea where the majority of my processing time is being spent, and I can take a closer look to see if I can do anything more efficiently.

Use Benchmarking to make sure you’re getting faster

Now that you’ve identified the hot spots in your code, you’re going to make some changes. How do you know if those changes are good? The answer is benchmarks.

A benchmark is a unit test for performance. For me, my benchmark was how many random game simulations I could do in 2 seconds. A larger number of simulations is better. I could run my benchmark, record the number of simulations in 2 seconds, make my changes, and run the benchmark again. If the number of simulations went up, the change was good and so I kept it.

I implemented this benchmark by just printing out a measurement at the end of my simulations, with a feature flag to toggle benchmarking off when it wasn’t relevant.

#[cfg(feature = "benchmarking")]
{
    let total_iterations: u32 = command_scores.iter()
        .map(|c| c.attempts).sum();
    println!("Iterations: {}", total_iterations);
}

Benchmarking is vitally important because performance considerations aren’t obvious. In some cases, I did something that I was convinced would make everything faster, but actually resulted in the overall speed being slower!

As Chandler Carruth put it at CppCon, “You only care about performance that you benchmark”.

Knowing what changes to make is a different topic

That’s all for today. I’m going to go into detail on the changes that gave me the biggest performance wins in a article coming soon! For now, remember that when it comes to performance, the answers aren’t obvious.

Make sure you’re using your tools to measure before and after making any change. Use a profiler to identify the parts of your code where you’re spending most of your time, and use benchmarking to ensure that your changes actually make your program faster.