Rusty Microphone
A retrospective on a Rust audio signal processing program I wrote

14 August 2017

I am an amateur trumpeteer. I started taking trumpet lessons when I was 9 years old, and kept it up all the way through high school. When I started university, my priorities changed and I focused much more heavily on my studies. My trumpet practise fell by the wayside, and I stopped taking lessons.

After the madness of university ended, I picked it up again. Now I’m only playing for my own enjoyment. I’m not pursuing exams, and I’m not taking lessons. Still, I aspire to improve whenever I’m playing.

One area of my playing that needs some work is my intonation.

This post is an explanation on why I wanted to write this program and how it works. If you’re just looking to jump straight to the source code, you can find it here.

What is intonation?

Most people have heard of a note being ‘sharp’ or ‘flat’. The idea is that a note’s pitch can be slightly too high or too low, and that it doesn’t sound great.

Unfortunately, it’s often not as simple as just looking at individual notes. It’s more about being sharp or flat relative to someone else. In my case, I’m primarily worried about being sharp or flat relative to myself.

Say for example I can play a C on its own and call it perfect. There’s nothing to compare it to, so that’s fine. If I then play a D, it could be sharp or flat relative to the C. If it isn’t the correct difference between the C and the D, then we’d say that my intonation is bad.

For a real world example of where this is important, consider that an orchestra will typically all tune to just one note. If I tune my trumpet to the same C as the whole orchestra, but my D isn’t right relative to my C, then my D will sound terrible with the rest of the orchestra.

Hearing the difference is difficult!

I’ve prepared a small demo for why this problem is so difficult for me to just solve when I’m playing on my own.

This scale has ‘perfect’ intonation. I created it using Audacity, which is why I can say it’s scientifically exactly on pitch.

This one is a little bit off. I took the original ‘perfect’ scale, and used Audacity to slightly alter some of the pitches.

For me, it’s hard to hear the difference. I think in isolation, it’s hard for most people to hear the difference. However, take a listen to what happens if you play the good and the bad one together.

The moment that you have the correct one together with the incorrect one, it sounds obvious that there’s a problem.

Software engineering is the hammer I walk around with looking for nails, so I decided to write a program to help me see when my intonation is bad. That way, I can use the feedback from my program to train myself to hear when my intonation is off, and over time I can correct it.

Show me the Program!

Let’s start with the basics: what the program looks like when I’m playing the note correctly.

/assets/posts/rusty-mic/rusty-in-tune.png

Right at the top of the window, there’s a dropdown for selecting which microphone to use. This is important, since some computers will have many microphones.

Looking further down, the program names the note that I’m currently playing. The number is which octave it’s in. For example, B♭5 is an octave above B♭4. This isn’t really necessary for me as a musician (I’m not so far off that I don’t know what note I’m playing), but it gives me a nice sanity metric to reassure me that the program isn’t just spitting out completely random numbers.

The next thing to notice below that is the stripe in the middle of the screen, and the big black bar. When it looks like this, it means I’m in tune. I’ll start a practise session by picking a note and getting it in tune, adjusting the tuning on the trumpet itself as necessary. After that, the goal is to keep the other notes in tune relative to my original tuning note.

The two graphs are the input signal from the microphone and the auto-correlation of the signal. They aren’t important from a musical point of view, but they help me in debugging issues in the program. We’ll get to why those are important later in this post.

Something that doesn’t come across well in the screenshots is the speed that the screen is updating. I’m redrawing these graphs 60 times per second. There is no noticeable delay between me making a noise on the trumpet and the program responding.

My next example is things going less well. This is how the program looks if I’m flat.

/assets/posts/rusty-mic/rusty-flat.png

The big difference I want to point out is that the line has moved to the left of the screen, and the left side of the black square is now highlighted blue.

Conversely, this is how it looks when I’m sharp.

/assets/posts/rusty-mic/rusty-sharp.png

A red box on the right. In both cases, the colour gets brighter the further out I am.

Sometimes when I’m playing, I may be jumping between notes rapidly, so it’s important to give the feedback in such a way that it is as visible as possible.

If you want to try it out yourself, the source code and all instructions on how to compile it are available on my GitHub.

So, how does this program that I wrote in Rust work?

What is Rust?

If you go to Rust’s website, you will encounter the definition that “Rust is a systems programming language that runs blazingly fast, prevents segfaults, and guarantees thread safety.” That’s a lot of jargon up front, so I have a less formal introductory sentence:

Rust is a programming language for people who currently need to write C or C++, due to system resource limitations or other low level constraints, and are tired of accidentally shooting themselves in the foot.

Sometimes systems programming languages are necessary

I’ve spent most of my career up to this point working in high level languages. Currently, in my day job, I’m writing a combination of Scala and JavaScript. I’ve previously worked in C# as well. High level languages with garbage collectors are incredible for developer productivity and focusing on the problem you are actually trying to solve. However, even though the runtimes of these languages are incredibly efficient (and in most cases will output faster programs than you would have rushed out in C), it is still possible to write more efficient programs in C. It’s simply closer to what the machine is actually going to execute.

The Haskell wiki entry on performance captures my feeling on the matter fairly well. Your first call of action should be to restructure your algorithms on a high level. But when you’ve reached the limits of that, you may need to write your algorithm in C to improve.

If the worst comes to the worst, you can always write your critical code in C and use the FFI to call it.

the Haskell wiki entry on performance

Rust positions itself as a better alternative for these problems. A language that allows high level abstractions, but at the end of the day compiles down to machine code that has similar performance and memory usage characteristics to a program written in C. If you’ve hit a performance wall and you’re thinking “we should rewrite this critical section in C”, it might be a better choice to rewrite it in Rust instead.

How Does Rust do this?

Rust is built on the same basic concept as C++: Zero-Cost Abstractions. The language allows building up layers of abstraction, which after running them through a compiler will result in the same assembly code as if you’d done it the hard way.

Rust also has no runtime or garbage collection. You manage your memory through RAII, another concept borrowed from C++, meaning that things are removed from memory when they go out of scope. Rust improves on C++’s model here by having the compiler enforce guarantees around memory use and mutability. If you write code that would have resulted in a memory leak or race condition in C++, it will more often than not be a compile error in Rust. This compile time checking is how they can make bold statements like saying Rust “guarantees thread safety” and “prevents segfaults”.

Lastly, Rust leverages some of the existing C and C++ compiler infrastructure to get their performance benefits. Rust makes use of the LLVM compiler backend so that it can take advantage of the large ecosystem of compiler optimizations that already exist.

But is Rust Functional?

Rust has been compared to functional languages. Its type system in particular should look familiar to programmers with a functional background. However, at its core it is an imperative language first.

A bit later in this post, when I’m showing snippets of code from the program, you’ll see some areas where Rust has enabled me to write code that looks suspiciously as if it were written in a functional language.

What is sound?

Moving on to the actual problem that I’m trying to solve, we first need to have an appreciation for what sound is, and what that means for musical pitch.

Sound is vibrating air. The air particles, disturbed by me and my trumpet, move forwards and backwards in a longitudinal wave. If you look at a single particle of air, it keeps moving backwards and forwards over the same point, but if you look at the system as a whole you can see areas of higher and lower air pressure moving along.

When these waves hit our ears, we perceive it as sound.

‘Musical’ Sound, Pitch and Frequency

‘Musical’ sound is special in that it’s periodic. To put it differently, random sound may have any particle movement back and forth, but musical sound will repeat according to some regular pattern.

The number of times the vibration’s pattern repeats every second is what we call the fundamental frequency of the sound. Just for a sense of a time scale, our audible range is somewhere between 50Hz to 20KHz (50 to 20000 repetitions per second). The exact range here will vary from person to person, and will typically shrink as you get older.

Here’s how this all relates to pitch: as the fundamental frequency of the sound gets higher (more repetitions per second), we experience the pitch as getting higher. If you hear a 262Hz signal, it’s the middle C on a piano keyboard. The D above that is 294Hz, and the B below it is 245Hz.

This makes the mission of my application clear: measure this fundamental frequency to display the pitch of the note I’m playing.

From Sound Waves to Signal Processing

Microphones

The first step is to pull the sound out of the air and into the computer. This can be done through a microphone. There are several types of microphones, but most of them work on the same basic principle: you have something that will be moved by sound, and some way of measuring that movement.

I have a ‘condenser’ microphone, which works by having the sound vibrate the one plate of a capacitor. The plate moving closer and further away directly affects how much charge the capacitor can hold (its capacitance), which results in an electrical signal in the connected circuit.

Sampling

Having the sound as an electrical signal is a good first step, but computers can’t work with analogue signals. It needs the signal in a digital format! This is achieved through sampling.

Sampling is the process of taking an analogue signal, and measuring its precise value into a digital number at regular intervals. The standard for most computer systems is to take 44100 samples of a signal per second.

Depending on your hardware, the sampling will either by done by your sound card or, as in my case, by specialized hardware on the microphone itself. I plug my microphone in over USB, and as far as my computer knows the samples are the only representation of the sound that exist.

But having the samples merely inside the computer is not enough. I need to get the samples to my program.

PortAudio - one audio API to rule them all

Your operating system can give you access to the samples from the microphone through one of its APIs. Unfortunately, which API and how to use it is tied up in which operating system you’re using.

PortAudio is an open source C library which acts as a proxy for these various audio APIs. PortAudio is being used by many cross platform applications, including some that I happen to use myself like Audacity and VLC Media Player.

This diagram, taken from PortAudio’s API documentation, shows rather effectively why PortAudio is necessary.

/assets/posts/rusty-mic/portaudio-external-architecture-diagram.png

Rust-PortAudio

Since I’m writing a Rust program, it’s easier for me if I don’t need to call PortAudio (a C library) directly. Luckily, there are already Rust bindings for PortAudio available.

Rust has a package manager and build tool called Cargo, which makes it easy to include other Rust libraries in your project. This is the block in my Cargo.toml for adding the Rust PortAudio bindings to my project.

[dependencies]
portaudio = "0.7.0"

Unfortunately, this doesn’t handle installing PortAudio itself, only the Rust bindings, so you also need to install PortAudio on your computer before this will work, perhaps using a package manager like Pacman, Yum, Apt, or Homebrew.

After you have that setup, this is how you’d get those samples:

// This is a lambda which I want called with the samples
let callback = move |InputStreamCallbackArgs { buffer, .. }| {
    // Do as little as possible in the callback. It needs to complete
    // before more samples arrive. Here, we just pass the samples into
    // a channel.

    // The settings are set outside of this snippet to pass 512
    // samples at a time.
    match sender.send(Vec::from(buffer)) {
        Ok(_) => Continue,
        Err(_) => Complete
    }
};

// Registers the callback with PortAudio
let mut stream = try!(
    pa.open_non_blocking_stream(stream_settings, callback)
);

// Start receiving samples!
try!(stream.start());

// I return the stream so I can stop it later, if the user changes
// microphones
Ok(stream)

Finally, I have the sound in my program. My callback is called with an array of 512 floating-point numbers, every 12 milliseconds.

This snippet shows off some interesting Rust features. You can define anonymous inline functions using vertical bars, as I do with the callback, and then pass the anonymous function into another function. The code here is strongly typed, but the actual types are almost always inferred from the context. Error handling is done through pattern matching on an Ok or Err, with a macro called try making it easy to have your function return early in case of errors.

Another interesting thing here is that almost every function call could return an error. Setting up the stream could fail if the settings ask for a sample rate that your microphone doesn’t support. Starting the stream after this could fail if the microphone was suddenly unplugged.

From Samples to Pitches

From the code above, we now have the samples from the microphone available to our program! If I play the trumpet and plot the signal, it looks like this:

/assets/posts/rusty-mic/trumpet-signal.png

The next thing we need to do is take this signal and find its fundamental frequency.

Auto-Correlation

I hinted at how we would be doing this earlier. We’re interested in finding out how often the signal repeats itself. There’s a transform that we can do to our signal called auto-correlation. The formal mathematical definition of the auto-correlation is like so:

\[ R(t) = \sum_{n \in \mathbb{Z}} y(n)y(n-t) \]

It looks complicated, but the idea is actually simple. You make a copy of the signal with a time delay. You then compare the signal to its time delayed counterpart and come up with a number representing how similar they are. You repeat this for different time delays, so that in the end you have a function representing how similar a signal is to itself given a time delay of a certain size. If you do this to the trumpet signal, and plot how its similarity to itself varies as you change the time delay, you get this graph.

/assets/posts/rusty-mic/trumpet-correlation.png

Of course, where the time delay is 0s, you have the graph’s maximum. It will never be as similar as if you just have a copy with no time delay. The second peak is the one we’re interested in. That time delay is how long it takes between signal repetitions. The auto-correlation tends towards zero because we treat all values outside of the 512 samples we have as zeroes.

This is my Rust function for calculating the signal’s auto-correlation.

pub fn correlation(signal: &[f32]) -> Vec<f32> {
    (0..signal.len()).map(|offset| {
        signal.iter().take(signal.len() - offset)
            .zip(signal.iter().skip(offset))
            .map(|(sig_i, sig_j)| sig_i * sig_j)
            .sum()
    }).collect()
}

Functional programmers will be happy to notice that this function does not have any mutable state in it, and does not have any side effects. It depends only on the state passed into it, and will give the same result every time it is called. In other words, this is a pure function.

Finding that peak

Now that we have the auto-correlation, we need some way to find the second peak. To do this we just need to find the maximum value after excluding the first (zero time delayed) peak.

// this gets the index where the first peak is over
let first_peak_end = match correlation.iter()
    .position(|&c| c < 0.0) {
        Some(p) => p,
        None => {
            // Musical signals will drop below 0 at some point.
            // This exits the whole function early with a result of
            // 'no fundamental frequency' if it doesn't
            return None
        }
    };

// This part finds the index of the maximum remaining value
let peak = correlation.iter()
    .enumerate() // Adds the indexes to the iterator
    .skip(first_peak_end) // Skips the first peak
    .fold((first_peak_end, 0.0), |(xi, xmag), (yi, &ymag)| {
        if ymag > xmag { (yi, ymag) } else { (xi, xmag) }
    });

// The fold above returns the index and its value as a tuple. This
// will pull out just the index and ignore the value.
let (peak_index, _) = peak;

Some(sample_rate / peak_index as f32)

Another Rust feature that I want to point out from this code is that in Rust, a variable isn’t just automatically nullable. In fact, there is no null keyword. If you want to model data that can be something or nothing (like the return type of a function that calculates fundamental frequency, but might also say there isn’t one), your return type needs to be an “Option”, which means it could be a “Some” or a “None”. If you don’t do that, it has to have a value. Good riddance, unexpected null reference exceptions!

Mapping to a pitch

After all of that signal processing, getting a named pitch is a simple matter.

The mathematical relationship between frequency and pitch is actually logarithmic. The first thing I do is transform the note to its midi number equivalent to give it a linear relationship with note names.

// This is based on a music tuning standard called 'A440', which
// literally means "we've decided that A should have a frequency
// of 440Hz"
//
// The midi number of that A (equally arbitrarily) is 69.
//
// All other notes are being defined relative to A, the knowledge that
// octaves are a power of 2 apart, and there are 12 different notes in
// an octave.
pub fn hz_to_midi_number(hz: f32) -> f32 {
    69.0 + 12.0 * (hz / 440.0).log2()
}

From the midi number, I know two things. The first is that the nearest integer is which note I’m on. I want to print that on the screen in a label, so I wrote this bit of code.

pub fn hz_to_pitch(hz: f32) -> String {
    let pitch_names = [
        "C","C♯","D","E♭","E","F","F♯","G","G♯","A","B♭","B"
    ];

    let midi_number = hz_to_midi_number(hz);
    let rounded_pitch = midi_number.round() as i32;

    let name_index = rounded_pitch as usize % pitch_names.len();
    let name = pitch_names[name_index];
    let octave = rounded_pitch / pitch_names.len() as i32 - 1;

    // fun fact, this format string will be type checked at compile
    // time.
    format!("{: <2}{}", name, octave)
}

The second thing is from the fractional part of the midi number. That distance to the closest integer is how sharp or flat I am (AKA the whole reason we’re doing this).

I pull that out, and multiply it by 100 to get a nice percent-like measure of how I’m doing between -50% and +50%.

pub fn hz_to_cents_error(hz: f32) -> f32 {
    let midi_number = hz_to_midi_number(hz);
    let cents = (midi_number % 1.0) * 100.0;
    if cents >= 50.0 {
        cents - 100.0
    }
    else {
        cents
    }
}

Drawing it on the Screen

Unfortunately, Rust does not have a de facto standard pure Rust GUI toolkit. It does, however, have bindings for several major C GUI toolkits. For this program, I included GTK.

As with the PortAudio dependency, the Rust bindings are installed using Cargo, but the actual GTK library needs to be installed on the system through some other means.

let window = gtk::Window::new(gtk::WindowType::Toplevel);
window.set_title("Rusty Microphone");
window.connect_delete_event(|_, _| {
    gtk::main_quit();
    Inhibit(false)
});

This code snippets highlights one of the unfortunate parts of using C libraries. At the top, I create a new window, which I declare using let. This should be an immutable reference, but on the next line I can set the window’s title. This is because all of the functions on window are calling out to the C library, and Rust loses sight of what does or does not mutate the variable.

Most of the events in GTK are handled by passing callbacks to certain events. For example, I passed a callback to end the program into the window’s delete event. Another example is how I pass a callback to a draw event to actually draw my intonation indicator on the screen.

canvas.connect_draw(move |canvas, context| {
  let width = canvas.get_allocated_width() as f64;

  // For brevity, I've only included the bit that draws the blue
  // 'flat' rectangle. The boilerplate to get the data it's drawing is
  // also omitted.

  let blue = if error < 0.0 {-error as f64/50.0} else {0.0};
  context.set_source_rgb(0.0, 0.0, blue);
  context.rectangle(0.0, line_indicator_height, midpoint,
                    color_indicator_height+line_indicator_height);
  context.fill(); // without the fill command, it draws nothing

  gtk::Inhibit(false)
});

Takeaway Points

The compiler is your friend, even when it doesn’t feel like it

This program is, frankly put, a mess of callbacks. All of the GUI functions require callbacks. PortAudio required a callback. I even have a second thread that I use to make sure that the user interface remains responsive to user inputs, even if the calculations run on too long.

Early on in development, I had a lot of trouble figuring out the architecture to make this work. There was a lot of frustration as the compiler told me the problems with how I was trying to do things. In C or C++, it would have just compiled and I could have had something running much earlier.

I also would have had random crashes as I wasn’t keeping PortAudio active long enough, weird memory corruption because I wasn’t using mutexes between the threads correctly, and MORE weird memory corruption from how I was passing pointers into callbacks.

The compiler, even though it didn’t always feel like my friend at the time, saved me from a lot of future heartache. These delays early on also made me understand Rust better and taught me better practices, so I can spend less time fighting with the compiler in future projects.

Functional Components held together with Non-Functional Glue

I found that for some parts of my Rust code, writing in a more functional style worked well. Not just well, it worked amazingly. Especially in the math oriented code, writing in such a way that I avoid any mutable state made the code both more concise and easier to figure out what it was doing.

However, there were other parts of the code where a functional style did not fit. This is mostly in interactions with the outside world. Doing the steps to set up the microphone, or create the program’s GUI with all of its components, didn’t lend themselves neatly to pure functions.

Which brings me to a final thought: why not both? It worked well for me to write functional components for processing business logic. Then I called these functions from the non-functional context that was handling the GUI and how to render things. Maybe as I learn more about functional programming and Rust I will change my mind, but for now this feels like a good place to work from.