This section provides an overview of how multi-channel signals are handled in librosa. The one-sentence summary is that most of the functions which only supported single-channel inputs up to librosa 0.8 now support multi-channel audio with no modification necessary.
Before discussing multi-channel, it is worth reviewing how single-channel (monaural)
signals are processed.
Librosa processes all signals and derived data as
numpy.ndarray (N-dimensional array) objects.
By default, when librosa loads a multichannel signal, it averages all channels to produce a mono mixture.
The resulting object is a 1-dimensional array of shape
represents the time-series of sample values.
Subsequent processing typically produces higher-dimensional transformations of the
time-series, for example, a short-time Fourier transform (
a two-dimensional array of shape
Note that the second (trailing) dimension corresponds to the number of frames in the
signal, which is proportional to length; the first dimension corresponds to the
number of frequencies (or more generally, features) measured at each frame.
When working with multi-channel signals, we may choose to skip the default down-mixing
step by specifying
mono=False in the call to
librosa.load, as in the following:
import librosa # Get the "high-quality" multi-channel version of # an example track filename = librosa.ex('trumpet', hq=True) # Load as multi-channel data y_stereo, sr = librosa.load(filename, mono=False)
The resulting object now has two dimensions instead of one, with
This way, we can access the first channel as
y_stereo, the second channel as
y_stereo, and so on if there are more than two channels.
Librosa represents data according to the following general pattern:
trailing dimensions correspond to time (samples, frames, etc)
leading dimensions may correspond to channels
intermediate dimensions correspond to “features” (or frequencies, harmonics, etc).
This pattern is designed so that indexing is consistent when slicing out individual channels. This is demonstrated in the examples below.
As a first example, consider computing a short-time Fourier transform of the stereo example signal loaded above. This is accomplished in exactly the same way as if the signal was mono, that is:
D_stereo = librosa.stft(y_stereo)
The shape of the resulting STFT is
D_stereo.shape == (N_channels, N_frequencies, N_frames).
D_stereo then corresponds to the STFT of the first channel
D_stereo is the STFT of the second channel
y_stereo, and so on.
As a more advanced example, we can construct a multi-channel, harmonic spectrogram
S_stereo = np.abs(D_stereo) # Get the default Fourier frequencies freqs = librosa.fft_frequencies(sr=sr) # We'll interpolate the first five harmonics of each frequency harmonics = [1, 2, 3, 4, 5] S_harmonics = librosa.interp_harmonics(S_stereo, freqs=freqs, h_range=harmonics)
The resulting object has four dimensions now:
S_harmonics.shape == (N_channels,
N_harmonics, N_frequencies, N_frames).
As noted above, the leading dimension corresponds to channels, the trailing
dimension corresponds to time (frames), and the intermediate dimensions correspond
to derived features.
In this way, indexing a specific channel (e.g.,
S_harmonics for the second
channel) provides the entire feature array derived from the second channel, and
produces an output of shape
(N_harmonics, N_frequencies, N_frames).
When reading the library documentation, you may come across functions like
librosa.stft which describe the input signal parameter as:
y : np.ndarray [shape=(..., n)], real-valued
The “…” here is analogous to Python’s Ellipsis object, and in this context, it acts as a place-holder for “0 or more dimensions”.
This is analogous to numpy’s use of Ellipsis to bypass variable numbers of
For example, to slice a single frame
n out of the multi-channel harmonic spectrogram
above, you could do either:
S[:, :, :, n]
The latter is generally preferred as it generalizes to arbitrarily many leading dimensions.
Whenever functions are described as accepting shapes containing “…”, the implication is that the (arbitrarily many) leading dimensions are preserved in the output unless otherwise stated.
Some functions accept an
axis= parameter to specify a target axis along which to
As a general convention,
axis=-1 (the final axis) usually corresponds to “time”
(or samples, or frames), while
axis=-2 (the second-to-last axis) usually
corresponds to “frequency” or some other derived feature.
Not all functions in librosa naturally generalize to multi-channel data, though most do. Similarly, some functions do generalize, but in ways that may not match your expectations. This section briefly summarizes places where multi-channel support is limited.
Detectors with ragged output, for example beat tracking (
onset detection (
librosa.onset.onset_detect) do not support multi-channel inputs.
This is because the output may have differing numbers of events in each channel, and
therefore cannot be consistently stored in a
numpy.ndarray output object.
In these cases, it is best to either process each channel separately (if they are
truly independent) or aggregate representations across channels (e.g., by averaging
features) if they are strongly related.
Self- and cross-similarity matrices, as computed by
librosa.segment.recurrence_matrix have limited multi-channel support.
This is because the output objects may be sparse data structures (such as
scipy.sparse.csr_matrix) which do not generalize to more than two dimensions.
These functions still accept multi-channel input, but flatten the leading dimensions
(channels) when comparing features between different time-steps.
If independent similarity matrices are desired, it is recommended to process each
Decompositions and sequence alignments, like similarity matrices, have limited
Harmonic-percussive source separation (
librosa.decompose.hpss) can fully accept
multi-channel input with independent processing, but other decomposition
librosa.decompose.decompose) impose some
restrictions on how multi-channel inputs are processed.
Sequence alignment functions like librosa.decompose.dtw and
librosa.decompose.rqa operate much like similarity matrix functions, and interpret
leading dimensions as additional “feature” dimensions which are flattened prior to
Display functions have limited multi-channel support.
librosa.display.waveshow can accept single or 2-channel input, though the second
channel is only used when zoomed out to envelope mode.
librosa.display.specshow does not accept multi-channel input.
Advanced uses and caveats
Multi-channel support is relatively flexible in librosa.
In particular, you may organize channels over two dimensions or more, although a
single channel dimension is the most common use case.
For example, if you want to simultaneously process a collection of stereo recordings
of equal length, you may collect the signals into an array of shape
(N_tracks, N_channels, N_samples).
Any derived data (e.g. spectrograms like in the example above) would then have two
leading dimensions, corresponding first to track and then to channel within the
In theory, any number of leading dimensions can be used, though caution should be
exercised to minimize memory consumption.
Note that although many functions preserve channel independence, this is not
guaranteed in general.
For example, decibel scaling by
librosa.amplitude_to_db will compare each channel
to a reference value which may be derived from all channels simultaneously.
This can lead to differences in behavior when processing channels independently or
simultaneously as a multi-channel input.
Functions which guarantee channel-wise independence are documented accordingly.