librosa.stft

librosa.stft(y, *, n_fft=2048, hop_length=None, win_length=None, window='hann', center=True, dtype=None, pad_mode='constant', out=None)[source]

Short-time Fourier transform (STFT).

The STFT represents a signal in the time-frequency domain by computing discrete Fourier transforms (DFT) over short overlapping windows.

This function returns a complex-valued matrix D such that

np.abs(D[..., f, t]) is the magnitude of frequency bin f at frame t, and
np.angle(D[..., f, t]) is the phase of frequency bin f at frame t.

The integers t and f can be converted to physical units by means of the utility functions frames_to_samples and fft_frequencies.

Parameters:

ynp.ndarray [shape=(…, n)], real-valued

input signal. Multi-channel is supported.

n_fftint > 0 [scalar]

length of the windowed signal after padding with zeros. The number of rows in the STFT matrix D is (1 + n_fft/2). The default value, n_fft=2048 samples, corresponds to a physical duration of 93 milliseconds at a sample rate of 22050 Hz, i.e. the default sample rate in librosa. This value is well adapted for music signals. However, in speech processing, the recommended value is 512, corresponding to 23 milliseconds at a sample rate of 22050 Hz. In any case, we recommend setting n_fft to a power of two for optimizing the speed of the fast Fourier transform (FFT) algorithm.

hop_lengthint > 0 [scalar]

number of audio samples between adjacent STFT columns.

Smaller values increase the number of columns in D without affecting the frequency resolution of the STFT.

If unspecified, defaults to win_length // 4 (see below).

win_lengthint <= n_fft [scalar]

Each frame of audio is windowed by window of length win_length and then padded with zeros to match n_fft. Padding is added on both the left- and the right-side of the window so that the window is centered within the frame.

Smaller values improve the temporal resolution of the STFT (i.e. the ability to discriminate impulses that are closely spaced in time) at the expense of frequency resolution (i.e. the ability to discriminate pure tones that are closely spaced in frequency). This effect is known as the time-frequency localization trade-off and needs to be adjusted according to the properties of the input signal y.

If unspecified, defaults to win_length = n_fft.

windowstring, tuple, number, function, or np.ndarray [shape=(n_fft,)]

Either:

a window specification (string, tuple, or number); see scipy.signal.get_window
a window function, such as scipy.signal.windows.hann
a vector or array of length n_fft

Defaults to a raised cosine window (‘hann’), which is adequate for most applications in audio signal processing.

centerboolean

If True, the signal y is padded so that frame D[:, t] is centered at y[t * hop_length].

If False, then D[:, t] begins at y[t * hop_length].

Defaults to True, which simplifies the alignment of D onto a time grid by means of librosa.frames_to_samples. Note, however, that center must be set to False when analyzing signals with librosa.stream.

dtypenp.dtype, optional

Complex numeric type for D. Default is inferred to match the precision of the input signal.

pad_modestring or function

If center=True, this argument is passed to np.pad for padding the edges of the signal y. By default (pad_mode="constant"), y is padded on both sides with zeros.

Note

Not all padding modes supported by numpy.pad are supported here. wrap, mean, maximum, median, and minimum are not supported.

Other modes that depend at most on input values at the edges of the signal (e.g., constant, edge, linear_ramp) are supported.

If center=False, this argument is ignored.

outnp.ndarray or None

A pre-allocated, complex-valued array to store the STFT results. This must be of compatible shape and dtype for the given input parameters.

If out is larger than necessary for the provided input signal, then only a prefix slice of out will be used.

If not provided, a new array is allocated and returned.

Returns:

Dnp.ndarray [shape=(…, 1 + n_fft/2, n_frames), dtype=dtype]

Complex-valued matrix of short-term Fourier transform coefficients.

If a pre-allocated out array is provided, then D will be a reference to out.

If out is larger than necessary, then D will be a sliced view: D = out[…, :n_frames].