Why resample on load?

Posted on Wed 17 July 2019 in blog

One of the questions that I get most often has to do with how librosa handles loading of audio data, specifically,

Why does librosa always resample to 22050 Hz when I load a file?

This is an entirely reasonable question, and the answer isn't necessarily obvious. Rather than bury the explanation in the API documentation, I'm putting the explanation here in blog form.

What is a sampling rate?

Before diving into the details, we first need to all get on the same page about what a sampling rate is. Audio in the real world happens in continuous time, but computers don't have infinite precision, so we approximate continuous signals by collections of discrete samples. The sampling rate --- typically $f_s$ in the digital signal processing literature, or sr in librosa --- is defined as $1/t_s$, where $t_s$ is the amount of time (in seconds) between successive samples. Equivalently, $f_s$ is the number of observations per second in the discretely sampled signal. It's a basic fact, a theorem due to Nyquist and Shannon, that if a continuous signal has no content above some frequency $f$, then a sampling rate $f_s \geq 2f$ suffices to reconstruct the signal without introducing aliasing artifacts. Typically we go the other way: fix a sampling rate $f_s$, and then filter the signal to eliminate any content above $f_s/2$ before sampling.

For a fixed sampling rate $f_s$, a digital signal is represented as a sequence of samples: $y[n]$ (for $n = 0, 1, 2, \dots$), where the $n^\text{th}$ sample corresponds to the value of the signal at time $t = \frac{n}{f_s}$. This gives a general rule for converting between units of samples and units of time, which is helpful to have in the back of your head when reasoning about software interface design later on.

Compact discs (remember those?) used a standard sampling rate of 44100 Hz. This is partly because typical human perception tops out around 20000 Hz (hence $f_s \geq 40000$), and partly due to historical accidents. Unlike physical media (CDs), digital audio files (.WAV, .MP3, and so on) can have arbitrary sampling rates. Rates of 16000, 22050, 32000, 44100, and 48000 are all relatively common, and you can't rely on consistency from one file to the next. Fortunately, sample-rate conversion (or resampling) methods allow us to change the sampling rate of a digital signal as needed.

Why not use the file's native sampling rate?

When designing the librosa API, we had a few goals that weren't always necessarily in agreement.

First, we wanted it to be relatively simple to use, and have consistent default parameters shared across all functions. As Gael Varoquaux reiterated in his keynote at SciPy 2017: consistency, consistency, consistency! Having standardized default parameters means that a user is less likely to be surprised by unexpected behavior when working with different parts of the library. This makes the software easier to learn and use.

Second, we wanted the default parameters, such as frame length (number of samples in one frame of a short-time Fourier transform), to be expressed naturally. In audio signal processing, there are two ways this could have gone: either specify the frame length as a duration (in seconds), or as a number of samples.

  • Expressing frame length as a duration is nice because it uses real, physical units and is independent of the sampling rate: a 1-second frame occupies the same amount of "content", whether the sampling rate is 8000 or 16000 Hz. However, this would mean that the same function applied to two signals with different sampling rates would produce outputs of different dimensionality, and would therefore not be directly comparable.

  • Expressing frame length as a number of samples, on the other hand, always produces outputs of comparable dimension. However, the meaning of the contents can change, depending on the sampling rate. As it turns out, designing for dimensional compatibility is much more convenient when you consider that subsequent processing stages will need to know the dimension of the input data to operate correctly. It's easier to fix the sampling rate first, and then design around that, than vice versa.

  • Expressing frame length in terms of samples has the added bonus that we can design for efficiently calculable Fourier transforms. Most fast Fourier transform (FFT) implementations work best when the number of samples is an integral power of 2, and worst when the number is a large prime. Although the latter case is unlikely in general, defining frames in terms of samples leaves us in a better position to guarantee efficient implementation.

Third, we wanted to minimize the chance of users (i.e., myself) making simple mistakes by not accounting for the sampling rate. An analysis script, once written, should behave consistently across different input signals, and not depend strongly on the exact sampling rate. In practice, this meant that every analysis script involved immediately standardizing the sampling rate of a file after it was loaded, so it made sense to combine the two steps into one (by default) since it's the most common case when dealing with collections of audio.

After a bit of discussion, we pretty quickly decided that resample-on-load was the best compromise available for achieving consistency and simplicity at the API level.

Okay... but why 22050 Hz? Why not 44100 or 48000?

It's true: 44100 Hz is essentially the standard for "high (enough) quality" audio storage, and it would have been a sensible default.

However, we decided for the lower rate of 22050 for two reasons:

  1. It cuts down on memory consumption,
  2. 44100 was overkill for our most common tasks.

The first point is obvious, but the second point deserves a bit more discussion.

When we were initially developing librosa, our main use cases were analyzing corpora of old jazz recordings, music more generally, and speech signals. While humans (young ones, anyway) can hear up to around 20000 Hz, it's possible to successfully analyze music and speech data at much lower rates without sacrificing much. The highest pitches we usually care about detecting are around $C_9 \approx 8372~\text{Hz}$, well below the 11025 cutoff implied by $f_s = 22050$. There's certainly content above 11025 Hz, but it often turns out to be noisy or redundant with the lower parts of the spectrum, and not so informative for semantic analysis tasks like instrument classification, rhythm analysis, chord recognition, and so on.

I still don't like it. What can I do?

Sure, 22050 isn't right for every situation. It's a default setting, but not a requirement!

You have a few options though. First, you can always bypass resample-on-load by specifying sr=None:

y, sr = librosa.load(filename, sr=None)

You will need to remember to pass sr around to all relevant functions, and make sure your frame and hop lengths are tuned accordingly.

A slightly fancier alternative is to use the presets package, as illustrated in the example gallery to change the default. This approach uses some pythonic hackery to intercept function calls into a package (like librosa, but it works more generally), and gives you the option to override default parameter values. The end result is not so different from carrying the specific sr value around with you, but it does make for slightly cleaner code since the defaults can all be set globally in a preamble, rather than replicated everywhere. For example:

from presets import Preset
import librosa as _librosa
librosa = Preset(_librosa)
librosa['sr'] = 44100
librosa['n_fft'] = 4096
librosa['hop_length'] = 1024

would effectively change the default sampling rate to 44100, and double the frame and hop lengths n_fft and hop_length from their standard default values. These new defaults would persist throughout your coding session.

Summary

Resample-on-load was ultimately a usability choice. We felt that the initial computational effort at load time was a worthy trade-off if it could simplify software usage without sacrificing accuracy or quality in the most common cases.