Tutorial

This section covers the fundamentals of developing with librosa, including a package overview, basic and advanced usage, and integration with the scikit-learn package. We will assume basic familiarity with Python and NumPy/SciPy.

Overview

The librosa package is structured as collection of submodules:

  • librosa

    • librosa.beat

      Functions for estimating tempo and detecting beat events.

    • librosa.core

      Core functionality includes functions to load audio from disk, compute various spectrogram representations, and a variety of commonly used tools for music analysis. For convenience, all functionality in this submodule is directly accessible from the top-level librosa.* namespace.

    • librosa.decompose

      Functions for harmonic-percussive source separation (HPSS) and generic spectrogram decomposition using matrix decomposition methods implemented in scikit-learn.

    • librosa.display

      Visualization and display routines using matplotlib.

    • librosa.effects

      Time-domain audio processing, such as pitch shifting and time stretching. This submodule also provides time-domain wrappers for the decompose submodule.

    • librosa.feature

      Feature extraction and manipulation. This includes low-level feature extraction, such as chromagrams, Mel spectrogram, MFCC, and various other spectral and rhythmic features. Also provided are feature manipulation methods, such as delta features and memory embedding.

    • librosa.filters

      Filter-bank generation (chroma, pseudo-CQT, CQT, etc.). These are primarily internal functions used by other parts of librosa.

    • librosa.onset

      Onset detection and onset strength computation.

    • librosa.segment

      Functions useful for structural segmentation, such as recurrence matrix construction, time-lag representation, and sequentially constrained clustering.

    • librosa.sequence

      Functions for sequential modeling. Various forms of Viterbi decoding, and helper functions for constructing transition matrices.

    • librosa.util

      Helper utilities (normalization, padding, centering, etc.)

Quickstart

Before diving into the details, we’ll walk through a brief example program

 1# Beat tracking example
 2import librosa
 3
 4# 1. Get the file path to an included audio example
 5filename = librosa.example('nutcracker')
 6
 7
 8# 2. Load the audio as a waveform `y`
 9#    Store the sampling rate as `sr`
10y, sr = librosa.load(filename)
11
12# 3. Run the default beat tracker
13tempo, beat_frames = librosa.beat.beat_track(y=y, sr=sr)
14
15print('Estimated tempo: {:.2f} beats per minute'.format(tempo))
16
17# 4. Convert the frame indices of beat events into timestamps
18beat_times = librosa.frames_to_time(beat_frames, sr=sr)

The first step of the program:

filename = librosa.example('nutcracker')

gets the path to an audio example file included with librosa. After this step, filename will be a string variable containing the path to the example audio file.

The second step:

y, sr = librosa.load(filename)

loads and decodes the audio as a time series y, represented as a one-dimensional NumPy floating point array. The variable sr contains the sampling rate of y, that is, the number of samples per second of audio. By default, all audio is mixed to mono and resampled to 22050 Hz at load time. This behavior can be overridden by supplying additional arguments to librosa.load.

Next, we run the beat tracker:

tempo, beat_frames = librosa.beat.beat_track(y=y, sr=sr)

The output of the beat tracker is an estimate of the tempo (in beats per minute), and an array of frame numbers corresponding to detected beat events.

Frames here correspond to short windows of the signal (y), each separated by hop_length = 512 samples. librosa uses centered frames, so that the kth frame is centered around sample k * hop_length.

The next operation converts the frame numbers beat_frames into timings:

beat_times = librosa.frames_to_time(beat_frames, sr=sr)

Now, beat_times will be an array of timestamps (in seconds) corresponding to detected beat events.

The contents of beat_times should look something like this:

7.43
8.29
9.218
10.124
...

Advanced usage

Here we’ll cover a more advanced example, integrating harmonic-percussive separation, multiple spectral features, and beat-synchronous feature aggregation.

 1# Feature extraction example
 2import numpy as np
 3import librosa
 4
 5# Load the example clip
 6y, sr = librosa.load(librosa.ex('nutcracker'))
 7
 8# Set the hop length; at 22050 Hz, 512 samples ~= 23ms
 9hop_length = 512
10
11# Separate harmonics and percussives into two waveforms
12y_harmonic, y_percussive = librosa.effects.hpss(y)
13
14# Beat track on the percussive signal
15tempo, beat_frames = librosa.beat.beat_track(y=y_percussive,
16                                             sr=sr)
17
18# Compute MFCC features from the raw signal
19mfcc = librosa.feature.mfcc(y=y, sr=sr, hop_length=hop_length, n_mfcc=13)
20
21# And the first-order differences (delta features)
22mfcc_delta = librosa.feature.delta(mfcc)
23
24# Stack and synchronize between beat events
25# This time, we'll use the mean value (default) instead of median
26beat_mfcc_delta = librosa.util.sync(np.vstack([mfcc, mfcc_delta]),
27                                    beat_frames)
28
29# Compute chroma features from the harmonic signal
30chromagram = librosa.feature.chroma_cqt(y=y_harmonic,
31                                        sr=sr)
32
33# Aggregate chroma features between beat events
34# We'll use the median value of each feature between beat frames
35beat_chroma = librosa.util.sync(chromagram,
36                                beat_frames,
37                                aggregate=np.median)
38
39# Finally, stack all beat-synchronous features together
40beat_features = np.vstack([beat_chroma, beat_mfcc_delta])

This example builds on tools we’ve already covered in the quickstart example, so here we’ll focus just on the new parts.

The first difference is the use of the effects module for time-series harmonic-percussive separation:

y_harmonic, y_percussive = librosa.effects.hpss(y)

The result of this line is that the time series y has been separated into two time series, containing the harmonic (tonal) and percussive (transient) portions of the signal. Each of y_harmonic and y_percussive have the same shape and duration as y.

The motivation for this kind of operation is two-fold: first, percussive elements tend to be stronger indicators of rhythmic content, and can help provide more stable beat tracking results; second, percussive elements can pollute tonal feature representations (such as chroma) by contributing energy across all frequency bands, so we’d be better off without them.

Next, we introduce the feature module and extract the Mel-frequency cepstral coefficients from the raw signal y:

mfcc = librosa.feature.mfcc(y=y, sr=sr, hop_length=hop_length, n_mfcc=13)

The output of this function is the matrix mfcc, which is a numpy.ndarray of shape (n_mfcc, T) (where T denotes the track duration in frames). Note that we use the same hop_length here as in the beat tracker, so the detected beat_frames values correspond to columns of mfcc.

The first type of feature manipulation we introduce is delta, which computes (smoothed) first-order differences among columns of its input:

mfcc_delta = librosa.feature.delta(mfcc)

The resulting matrix mfcc_delta has the same shape as the input mfcc.

The second type of feature manipulation is sync, which aggregates columns of its input between sample indices (e.g., beat frames):

beat_mfcc_delta = librosa.util.sync(np.vstack([mfcc, mfcc_delta]),
                                    beat_frames)

Here, we’ve vertically stacked the mfcc and mfcc_delta matrices together. The result of this operation is a matrix beat_mfcc_delta with the same number of rows as its input, but the number of columns depends on beat_frames. Each column beat_mfcc_delta[:, k] will be the average of input columns between beat_frames[k] and beat_frames[k+1]. (beat_frames will be expanded to span the full range [0, T] so that all data is accounted for.)

Next, we compute a chromagram using just the harmonic component:

chromagram = librosa.feature.chroma_cqt(y=y_harmonic,
                                        sr=sr)

After this line, chromagram will be a numpy.ndarray of shape (12, T), and each row corresponds to a pitch class (e.g., C, C#, etc.). Each column of chromagram is normalized by its peak value, though this behavior can be overridden by setting the norm parameter.

Once we have the chromagram and list of beat frames, we again synchronize the chroma between beat events:

beat_chroma = librosa.util.sync(chromagram,
                                beat_frames,
                                aggregate=np.median)

This time, we’ve replaced the default aggregate operation (average, as used above for MFCCs) with the median. In general, any statistical summarization function can be supplied here, including np.max(), np.min(), np.std(), etc.

Finally, the all features are vertically stacked again:

beat_features = np.vstack([beat_chroma, beat_mfcc_delta])

resulting in a feature matrix beat_features of shape (12 + 13 + 13, # beat intervals).

More examples

More example scripts are provided in the advanced examples section.