Going Four Times Faster using Multi-Threading
Rust makes writing parallel code safe. Rayon makes it easy.

03 October 2018

Recently, I was trying to make some code I’d written go as fast as possible. The code that I was making run fast was my entry to this year’s Entelect Challenge. With my problem well understood, my benchmarks in place to show me how fast I was currently going, and my profiler pointing at the most computationally expensive parts of my code, I was ready to make things faster.

One way to speed up a program is to do more things at the same time. Modern processors have many cores, allowing them to do multiple things at once. In this competition, the tournament servers had 4 cores. This meant that, by making use of all of the cores available, my program could run up to 4 times more simulations than it could if it was only using one core at a time.

The full source code for my bot is up on GitHub if you want to dig through the code to see how it works.

Writing parallel code is hard

There’s a reason that we don’t always just start with code that makes use of all of the available cores. There are a few pitfalls to be aware of in multi-threaded code.

The first pitfall is race conditions. A race condition is when you have two threads in your program that are running in parallel, both are independently doing the right thing, but somehow the interaction between the two introduces bugs. The usual issue here is two threads trying to modify the same data at the same time.

Code as simple as x = x + 1 can be a bug if two threads try to run it on the same x at the same time.

The second pitfall is considering how you’re going to split the work between the threads. Creating thread, and communicating between them, can be computationally expensive so you need to manage them carefully. You also need to make sure that the work is evenly distributed between the threads so that none of your cores sit idle. If you get this wrong, and spend too much time creating threads and coordinating between them, you can actually make your program run slower instead of faster!

Rust is the best language I’ve used for parallel code

One of the killer features of the Rust programming language is that it has compile time reasoning about parallel code. Programs that have race conditions in them do not compile in Rust.

As an example, say that I wanted to make a reference counted pointer in Rust (this would be called a shared_ptr in C++), and pass a copy of it to a new thread. Rust has two reference counted pointer types in it’s standard library: Rc is faster, but Arc has the necessary checks in place to make it safe to use in a multi-threaded program. If I were to use Rc, when I actually need to use Arc, then I get an error when I compile.

use std::thread;
use std::rc::Rc;

let shared_ptr = Rc::new(10);

let child_thread = thread::spawn(move || {
    println!("Hello world from a different thread, {}", *shared_ptr);
});

The Rust compiler’s approach to ownership and mutability also ensure that you don’t forget to use a mutex when you need to. I’ve written a longer article on how to write your own multi-threaded code in Rust before, but that didn’t address the second pitfall. How do you share the work event between threads?

Rayon to the rescue!

Rayon is a Rust library for writing parallel code more easily. Specifically, if you have a place where you’re currently using an iterator, Rayon makes it easy to convert it into a parallel iterator.

Before I added Rayon, I had a loop that looked like this:

fn simulate_all_options_once<R: Rng>(command_scores: &mut[CommandScore],
                                     state: &BitwiseGameState,
                                     rng: &mut R) {
    command_scores.iter_mut()
        .for_each(|score| {
            simulate_to_endstate(score, state, rng);
        });
}

It would go through all of the moves in the game that I wanted to evaluate, and for each one it would simulate a random game starting with that move. Inside simulate_to_endstate, it adds a count to how many simulations have been won so far, which is why it needs to use a mutable iterator (iter_mut()). This code did what I wanted, but it only used one core.

After adding Rayon, my loop looked like this:

use rayon::prelude::*;
fn simulate_all_options_once(command_scores: &mut[CommandScore],
                             state: &BitwiseGameState) {

    command_scores.par_iter_mut()
        .for_each(|score| {
            let mut rng = XorShiftRng::from_seed(score.next_seed);
            simulate_to_endstate(score, state, &mut rng);
        });
}

This is almost identical code, but by changing the iter_mut() to par_iter_mut(), Rayon will share my loop over multiple threads. Rayon will also, in the background, handle starting up a thread pool, with a sensible number of threads. Rayon will also manage sharing the work between the threads, keeping all of the threads busy until all of the processing is done.

Something else notable is that I moved the creation of my random number generator inside the loop. How did I know that I needed to do this? The Rust compiler refused to compile it until I’d fixed my race condition.

The results

So, what’s the impact of this minor change? When I run the benchmark on my computer, the ‘before’ code sample completes 157 040 game simulations. The ‘after’ code sample completes 488 480! That’s a bit more than 3 times faster! This isn’t quite the theoretical maximum of 4 times faster on 4 cores, but really not bad for importing a library and making a minor code change.

To get really fast code though, just making it run in parallel isn’t enough. There were even larger gains that I found in some of the smaller details of how I constructed my game simulation. That, however, will be the topic of a future article.