Caution

You're reading an old version of this documentation. If you want up-to-date information, please have a look at 0.10.2.

Note

Go to the end to download the full example code.

Viterbi decoding

This notebook demonstrates how to use Viterbi decoding to impose temporal smoothing on frame-wise state predictions.

Our working example will be the problem of silence/non-silence detection.

# Code source: Brian McFee
# License: ISC

##################
# Standard imports
import numpy as np
import matplotlib.pyplot as plt
import librosa

import librosa.display

Load an example signal

y, sr = librosa.load(librosa.ex('trumpet'))


# And compute the spectrogram magnitude and phase
S_full, phase = librosa.magphase(librosa.stft(y))


###################
# Plot the spectrum
fig, ax = plt.subplots()
img = librosa.display.specshow(librosa.amplitude_to_db(S_full, ref=np.max),
                               y_axis='log', x_axis='time', sr=sr, ax=ax)
fig.colorbar(img, ax=ax);

As you can see, there are periods of silence and non-silence throughout this recording.

# As a first step, we can plot the root-mean-square (RMS) curve
rms = librosa.feature.rms(y=y)[0]

times = librosa.frames_to_time(np.arange(len(rms)))

fig, ax = plt.subplots()
ax.plot(times, rms)
ax.axhline(0.02, color='r', alpha=0.5)
ax.set(xlabel='Time', ylabel='RMS');

The red line at 0.02 indicates a reasonable threshold for silence detection. However, the RMS curve occasionally dips below the threshold momentarily, and we would prefer the detector to not count these brief dips as silence. This is where the Viterbi algorithm comes in handy!

As a first step, we will convert the raw RMS score into a likelihood (probability) by logistic mapping

\(P[V=1 | x] = \frac{\exp(x - \tau)}{1 + \exp(x - \tau)}\)

where \(x\) denotes the RMS value and \(\tau=0.02\) is our threshold. The variable \(V\) indicates whether the signal is non-silent (1) or silent (0).

We’ll normalize the RMS by its standard deviation to expand the range of the probability vector

r_normalized = (rms - 0.02) / np.std(rms)
p = np.exp(r_normalized) / (1 + np.exp(r_normalized))

We can plot the probability curve over time:

fig, ax = plt.subplots()
ax.plot(times, p, label='P[V=1|x]')
ax.axhline(0.5, color='r', alpha=0.5, label='Descision threshold')
ax.set(xlabel='Time')
ax.legend();

which looks much like the first plot, but with the decision threshold shifted to 0.5. A simple silence detector would classify each frame independently of its neighbors, which would result in the following plot:

plt.figure(figsize=(12, 6))
fig, ax = plt.subplots(nrows=2, sharex=True)
librosa.display.specshow(librosa.amplitude_to_db(S_full, ref=np.max),
                         y_axis='log', x_axis='time', sr=sr, ax=ax[0])
ax[0].label_outer()
ax[1].step(times, p>=0.5, label='Non-silent')
ax[1].set(ylim=[0, 1.05])
ax[1].legend()

We can do better using the Viterbi algorithm. We’ll use state 0 to indicate silent, and 1 to indicate non-silent. We’ll assume that a silent frame is equally likely to be followed by silence or non-silence, but that non-silence is slightly more likely to be followed by non-silence. This is accomplished by building a self-loop transition matrix, where transition[i, j] is the probability of moving from state i to state j in the next frame.

transition = librosa.sequence.transition_loop(2, [0.5, 0.6])
print(transition)

[[0.5 0.5]
 [0.4 0.6]]

Our p variable only indicates the probability of non-silence, so we need to also compute the probability of silence as its complement.

full_p = np.vstack([1 - p, p])
print(full_p)

[[0.13145119 0.12914789 0.10023135 0.08308673 0.0896498  0.09849459
11708623 0.15954721 0.15366054 0.10710979 0.07477838 0.05496722
0554986  0.07781726 0.11711758 0.16114128 0.18928099 0.21416843
21213973 0.19609505 0.20449305 0.2270965  0.27209795 0.3001222
17316502 0.09887975 0.06782663 0.05833912 0.07995546 0.13456148
20417827 0.24610686 0.2866913  0.28002036 0.27008408 0.27990294
3091812  0.3000993  0.1969521  0.12827969 0.08002245 0.05841959
07621807 0.11728209 0.20397991 0.24574691 0.23458302 0.18702763
13608563 0.14634806 0.12772375 0.1288073  0.1530726  0.17516875
2165522  0.27420092 0.33675003 0.43482685 0.48404706 0.25782114
17491055 0.16958952 0.15419865 0.19936448 0.28010595 0.27197605
2870167  0.31801683 0.31980097 0.3321712  0.35508192 0.3624373
35120147 0.31828165 0.29999107 0.29278505 0.30191094 0.33024365
36096853 0.40480107 0.45284116 0.49093616 0.54277027 0.56993824
5704266  0.5268004  0.25468153 0.14759523 0.09926701 0.07283568
07515341 0.08709419 0.10409516 0.14761192 0.21817732 0.32771122
5189303  0.55528986 0.56176543 0.5374347  0.37758958 0.25777513
19084197 0.14156103 0.11547786 0.10913092 0.10770041 0.1312679
16778678 0.19980133 0.21077996 0.1821512  0.16594243 0.16683942
17637831 0.188272   0.20692706 0.21882313 0.23622775 0.2530697
25842535 0.26530695 0.27669293 0.2868473  0.30373114 0.32258862
33911967 0.36671662 0.40503937 0.44620156 0.5000478  0.549015
5662594  0.57089055 0.5719327  0.5729972  0.5738338  0.57463896
57614064 0.57710177 0.57834494 0.5796151  0.5806234  0.5815384
5821718  0.5833256  0.5844451  0.58522177 0.5857115  0.58568585
5859337  0.58618915 0.58630896 0.5865722  0.5867528  0.586874
58707404 0.587249   0.5873786  0.5875926  0.5878352  0.5880511
588247   0.5883261  0.5883925  0.5884396  0.58849066 0.5886303
58867866 0.5886854  0.5886949  0.5887034  0.5887745  0.5888357
5888435  0.58883864 0.58883625 0.58885485 0.58887374 0.5888771
58888704 0.5889026  0.5889103  0.5889286  0.5889324  0.58894205
5889672  0.5889791  0.5889935  0.58900595 0.5890088  0.5890169
58902466 0.5890299  0.58903694 0.5890435  0.5890446  0.58903813
58903635 0.5890361  0.58903944 0.58904475 0.5890448  0.5890424
5890394  0.5890368  0.5890384  0.5890396  0.5890429  0.5890487
5890497  0.5890506  0.589049   0.58904666 0.5890452  0.5890439
58904463 0.5890455  0.58904576 0.5890466  0.5890502  0.5890561
58905953 0.589064   0.5890665  0.58906525 0.5890669  0.589068
5890698  0.5890719 ]
 [0.8685488  0.8708521  0.89976865 0.9169133  0.9103502  0.9015054
88291377 0.8404528  0.84633946 0.8928902  0.9252216  0.9450328
9445014  0.92218274 0.8828824  0.8388587  0.810719   0.7858316
7878603  0.80390495 0.79550695 0.7729035  0.72790205 0.6998778
826835   0.90112025 0.9321734  0.9416609  0.92004454 0.8654385
7958217  0.75389314 0.7133087  0.71997964 0.7299159  0.72009706
6908188  0.6999007  0.8030479  0.8717203  0.91997755 0.9415804
92378193 0.8827179  0.7960201  0.7542531  0.765417   0.81297237
8639144  0.85365194 0.87227625 0.8711927  0.8469274  0.82483125
7834478  0.7257991  0.66324997 0.56517315 0.51595294 0.74217886
82508945 0.8304105  0.84580135 0.8006355  0.71989405 0.72802395
7129833  0.6819832  0.680199   0.6678288  0.6449181  0.6375627
6487985  0.68171835 0.7000089  0.70721495 0.69808906 0.66975635
63903147 0.5951989  0.54715884 0.50906384 0.45722976 0.43006176
42957345 0.4731996  0.7453185  0.8524048  0.900733   0.9271643
9248466  0.9129058  0.89590484 0.8523881  0.7818227  0.6722888
48106965 0.44471014 0.43823457 0.4625653  0.6224104  0.7422249
809158   0.85843897 0.88452214 0.8908691  0.8922996  0.8687321
8322132  0.8001987  0.78922004 0.8178488  0.83405757 0.8331606
8236217  0.811728   0.79307294 0.78117687 0.76377225 0.7469303
74157465 0.73469305 0.7233071  0.7131527  0.69626886 0.6774114
6608803  0.6332834  0.59496063 0.55379844 0.49995217 0.450985
43374065 0.42910945 0.42806736 0.42700282 0.42616615 0.42536104
4238594  0.42289823 0.4216551  0.42038485 0.41937658 0.41846162
41782823 0.41667438 0.41555488 0.41477823 0.41428855 0.41431412
41406628 0.41381082 0.413691   0.41342783 0.41324723 0.41312596
41292593 0.41275102 0.4126214  0.41240737 0.41216478 0.41194892
411753   0.41167387 0.41160753 0.4115604  0.41150934 0.41136965
41132134 0.41131464 0.41130513 0.41129664 0.4112255  0.4111643
41115648 0.41116136 0.41116375 0.41114515 0.4111263  0.4111229
41111296 0.41109738 0.41108972 0.41107145 0.41106758 0.41105795
41103283 0.41102085 0.4110065  0.41099402 0.41099116 0.4109831
4109753  0.4109701  0.41096306 0.4109565  0.41095543 0.41096187
41096365 0.41096386 0.41096056 0.41095525 0.41095516 0.41095757
4109606  0.41096318 0.4109616  0.41096038 0.41095707 0.41095132
41095033 0.4109494  0.41095105 0.41095334 0.41095477 0.41095608
41095537 0.4109545  0.41095427 0.41095343 0.4109498  0.41094393
41094047 0.41093597 0.4109335  0.41093475 0.41093308 0.410932
41093022 0.41092807]]

Now, we’re ready to decode! We’ll use viterbi_discriminative here, since the inputs are state likelihoods conditional on data (in our case, data is rms).

states = librosa.sequence.viterbi_discriminative(full_p, transition)

# sphinx_gallery_thumbnail_number = 5
fig, ax = plt.subplots(nrows=2, sharex=True)
librosa.display.specshow(librosa.amplitude_to_db(S_full, ref=np.max),
                         y_axis='log', x_axis='time', sr=sr, ax=ax[0])
ax[0].label_outer()
ax[1].step(times, p>=0.5, label='Frame-wise')
ax[1].step(times, states, linestyle='--', color='orange', label='Viterbi')
ax[1].set(ylim=[0, 1.05])
ax[1].legend()

Note how the Viterbi output has fewer state changes than the frame-wise predictor, and it is less sensitive to momentary dips in energy. This is controlled directly by the transition matrix. A higher self-transition probability means that the decoder is less likely to change states.

Total running time of the script: (0 minutes 2.134 seconds)

Gallery generated by Sphinx-Gallery