Streaming for large files

Posted on Mon 29 July 2019 in blog

Librosa was initially designed for processing relatively short fragments of recorded audio, typically not more than a few minutes in duration. While this describes most popular music (our initial target application area), it is a poor description of many other forms of audio, particularly those encountered in bioacoustics and environmental acoustics. In those settings, audio signals are commonly of durations on the order of multiple hours, if not days or weeks. This fact raises an immediate question:

How can I process long audio files with librosa?

This post describes the stream interface adopted in librosa version 0.7, including some background on the overall design of the library and our specific solution to this problem.

How does librosa work?

Before getting into the details of how do handle large files, it will help to understand librosa's data model more generally.

Early in the development of librosa, we made a conscious decision to rely only on numpy datatypes, and not develop a more structured object model. This decision was motivated by several factors, including but not limited to:

  1. ease of implementation,
  2. ease of use,
  3. ease of interoperability with other libraries, and
  4. syntactic similarity to previous MATLAB-based implementations, as well as theoretical (mathematical) definitions.

What this means, is that rather than having object-oriented code like:

x = not_librosa.Audio(some parameters)  # an object of type not_librosa.Audio
melspec = not_librosa.feature.MelSpectrogram(x)  # an object of type not_librosa.feature.MelSpectrogram

you instead get a more procedural style:

y, sr = librosa.load(some parameters)  # a numpy array
melspec = librosa.feature.melspectrogram(y, sr)  # another numpy array

As a result, it's fairly easy to move data out of librosa and into other python packages. Back in 2012, theano and scikit-learn were prime targets back in 2012. These days, it's more about tensorflow or pytorch, but the principle is the same. Having our own object interface to audio and features would get in the way, even if it would have made some design choices easier.

What's the problem with large files?

Both approaches (objective and procedural) described above have their pros and cons. The pros of the procedural approach are listed above, but one of the drawbacks that we inherit from numpy is that the entire input y must be constructed before melspec can be produced. This is ultimately a limitation of the numpy.ndarray type, which is explicitly designed as a container for fixed-length, contiguous regions of memory with consistent underlying data types (eg int or float). For short recordings --- our most common case in librosa --- this is fine: recordings typically fit in memory and are known in advance. However, when the audio you want to analyze is long (e.g., hours) or streaming from a recording device, ndarray is not an appropriate container type. We knew this in 2012, but decided to optimize for the common case and deal with the fallout later.

Now, if we had gone for an objective interface, we could have handled these problems in a variety of ways. For instance, it would have been easy to abstract away time-indexing logic so that data is only loaded when it's requested, eg Audio.get_buffer(time=SOME_NUMBER, duration=SOME_NUMBER). Or we could have used the object's internal state to maintain a buffer in memory, but not load the entire recording from storage, and provide an interface to seek to a specific time position in the signal. Various other libraries implement these kinds of solutions, and they can work great! But they do come with a bit of additional interface complexity, and might limit interoperability.

Streaming and generators

The solution that we ultimately went with in version 0.7 is to use Python generators. Rather than load the entire signal at once, we rely on soundfile to produce a sequence of fragments of the signal, which are then passed back to the user. At a high level, we would like to have an interface of the form:

for y in

where y now refers to a short excerpt of the much longer recording in question. However, this raises a few more questions:

  1. How big of an excerpt should we use?
  2. How do two neighboring excerpts relate to each-other?
  3. Can this be used with every function in librosa?

To dig into those, we have to think a bit more about how librosa represents data.

Samples, frames, and blocks

An audio buffer y is typically viewed as a sequence of discrete samples y[0], y[1], y[2], .... Most audio analyses operate at the level of frames of audio, for instance, taking y[0] ... y[2047] as one frame, followed by y[512] ... y[2047 + 512], and so on. Each frame here consists of exactly 2048 samples, and the time difference from one frame to the next is always 512 samples. (These are just the default parameters, of course.)

For most analysis cases, e.g. those based on the short-time Fourier transform, frames are modeled independently from one another. This means that it would be completely valid to process one frame entirely before moving on to the next; and indeed, many implementations operate in exactly this fashion. However, this can also be inefficient because it makes poor use of memory locality, as well as data- and algorithm-parallelism. It is generally more efficient, especially in Python/numpy, to operate on multiple frames simultaneously. This naturally incurs some latency while buffering data, but the end-result leads to improved throughput.

Now, a naive solution here would be to simply load a relatively long fragment y consisting of multiple frames, and process them in parallel before moving on to the next fragment. The tricky part is handling the boundaries correctly. If the hop length (number of samples between frames) is identical to the frame length (number of samples in each frame), then frames do not overlap, and we will not get into trouble by processing data in this way. However, if frames can overlap in time, then so should the longer fragments if we are to get the same answer at the end of the day. This is where we need to be a bit careful.


The solution we adopted in librosa 0.7 is the notion of a block, which is defined in terms of the number of frames, the frame length and the hop length between frames. Blocks overlap in exactly the same way that frames would normally: by frame length - hop_length samples.

To make this concrete, imagine that we have a frame length of 100 samples, a hop length of 25 samples, and a block size of 3 frames. The first few frames would look as follows:

  • y[0:100]
  • y[25:125]
  • y[50:150]
  • y[75:175]
  • y[100:200]
  • y[125:225]

The first block then covers samples y[0:150]. The second block covers samples y[75:225], and so on. The result here is that each frame belongs to exactly one block (and appears exactly once), but any given sample can occur in multiple blocks.

The block interface is provided by the new function, which is used as follows:

filename = librosa.util.example_audio_file()
sr = librosa.get_samplerate(filename)
for y_block in stream:
    # Process y_block


There are a few things to be aware of when using stream processing librosa.

First, following on our previous post, librosa.load will (by default) resample the input signal to a given sampling rate. However, this resampling operation needs access to the full signal (or at least quite a bit of the future) to work well, so resample-on-load is not supported in streaming. Practically, this means that you'll need to be aware of your sampling rate and analysis parameters in advance, and be sure to carry them over across all downstream processing.

Second, librosa's analyses are frame-centered by default. This means that when you compute, say, D = librosa.stft(y), the kth column D[:, k] covers a frame which centered around sample y[k * hop_length]. To do this, the signal is padded on the left (and right) so that D[:, 0] is centered at sample y[0]. This will cause trouble if you call librosa.stft(y_block), since the beginning (and end) of each block will be padded, and they would not have been padded had the entire sequence been provided to stft at once. Consequently, librosa does not support frame-centered analysis in streaming mode: frames are assumed to start at sample y[k * hop_length] rather than be centered around them.

As a general rule, always remember to include center=False when doing stream-based analysis:

for y_block in stream:
    D_block = librosa.stft(y_block, n_fft=4096, hop_length=1024, center=False)

and of course, be sure to match the frame and hop lengths to your block parameters.

What does and does not work?

Not all analyses support stream processing. For instance, anything that requires total knowledge of a sequence, such as recurrence matrix generation, will clearly not work. A bit more subtle are methods that rely on resampling, such as librosa.cqt.

However, any STFT-based analysis (such as most of the librosa.feature module) should work fine, and this already covers a large proportion of use cases.

The example gallery includes a notebook which demonstrates how to do stream-based processing with STFT and pcen normalization.


Block-based processing allows some, but not all of librosa's functionality to apply easily to large audio files.

While, in principle, this could also be applied to online streaming from live recording devices, we don't yet have a stable underlying implementation to rely upon for this, and hesitate to make any general recommendations.

If, at some point in the future, streaming sample rate conversion becomes viable, we will look at relaxing some of the constraints around resampling (e.g., on load or within cqt).