Caution

You're reading an old version of this documentation. If you want up-to-date information, please have a look at 0.9.1.

Note

Click here to download the full example code

Viterbi decoding¶

This notebook demonstrates how to use Viterbi decoding to impose temporal smoothing on frame-wise state predictions.

Our working example will be the problem of silence/non-silence detection.

# Code source: Brian McFee
# License: ISC

##################
# Standard imports
from __future__ import print_function
import numpy as np
import matplotlib.pyplot as plt
import librosa

import librosa.display

Load an example signal

y, sr = librosa.load('audio/sir_duke_slow.mp3')


# And compute the spectrogram magnitude and phase
S_full, phase = librosa.magphase(librosa.stft(y))


###################
# Plot the spectrum
plt.figure(figsize=(12, 4))
librosa.display.specshow(librosa.amplitude_to_db(S_full, ref=np.max),
                         y_axis='log', x_axis='time', sr=sr)
plt.colorbar()
plt.tight_layout()

Out:

/tmp/tmpfl4ra6qp/b0064fe7dbe8048b1d4148e61a568b6fe3fca91b/librosa/core/audio.py:161: UserWarning: PySoundFile failed. Trying audioread instead.
  warnings.warn('PySoundFile failed. Trying audioread instead.')
/tmp/tmpfl4ra6qp/b0064fe7dbe8048b1d4148e61a568b6fe3fca91b/librosa/display.py:862: MatplotlibDeprecationWarning: The 'basey' parameter of __init__() has been renamed 'base' since Matplotlib 3.3; support for the old name will be dropped two minor releases later.
  scaler(mode, **kwargs)
/tmp/tmpfl4ra6qp/b0064fe7dbe8048b1d4148e61a568b6fe3fca91b/librosa/display.py:862: MatplotlibDeprecationWarning: The 'linthreshy' parameter of __init__() has been renamed 'linthresh' since Matplotlib 3.3; support for the old name will be dropped two minor releases later.
  scaler(mode, **kwargs)
/tmp/tmpfl4ra6qp/b0064fe7dbe8048b1d4148e61a568b6fe3fca91b/librosa/display.py:862: MatplotlibDeprecationWarning: The 'linscaley' parameter of __init__() has been renamed 'linscale' since Matplotlib 3.3; support for the old name will be dropped two minor releases later.
  scaler(mode, **kwargs)

As you can see, there are periods of silence and non-silence throughout this recording.

# As a first step, we can plot the root-mean-square (RMS) curve
rms = librosa.feature.rms(y=y)[0]

times = librosa.frames_to_time(np.arange(len(rms)))

plt.figure(figsize=(12, 4))
plt.plot(times, rms)
plt.axhline(0.02, color='r', alpha=0.5)
plt.xlabel('Time')
plt.ylabel('RMS')
plt.axis('tight')
plt.tight_layout()

# The red line at 0.02 indicates a reasonable threshold for silence detection.
# However, the RMS curve occasionally dips below the threshold momentarily,
# and we would prefer the detector to not count these brief dips as silence.
# This is where the Viterbi algorithm comes in handy!

As a first step, we will convert the raw RMS score into a likelihood (probability) by logistic mapping

\(P[V=1 | x] = \frac{\exp(x - \tau)}{1 + \exp(x - \tau)}\)

where \(x\) denotes the RMS value and \(\tau=0.02\) is our threshold. The variable \(V\) indicates whether the signal is non-silent (1) or silent (0).

We’ll normalize the RMS by its standard deviation to expand the range of the probability vector

r_normalized = (rms - 0.02) / np.std(rms)
p = np.exp(r_normalized) / (1 + np.exp(r_normalized))

# We can plot the probability curve over time:

plt.figure(figsize=(12, 4))
plt.plot(times, p, label='P[V=1|x]')
plt.axhline(0.5, color='r', alpha=0.5, label='Descision threshold')
plt.xlabel('Time')
plt.axis('tight')
plt.legend()
plt.tight_layout()

which looks much like the first plot, but with the decision threshold shifted to 0.5. A simple silence detector would classify each frame independently of its neighbors, which would result in the following plot:

plt.figure(figsize=(12, 6))
ax = plt.subplot(2,1,1)
librosa.display.specshow(librosa.amplitude_to_db(S_full, ref=np.max),
                         y_axis='log', x_axis='time', sr=sr)
plt.subplot(2,1,2, sharex=ax)
plt.step(times, p>=0.5, label='Non-silent')
plt.xlabel('Time')
plt.axis('tight')
plt.ylim([0, 1.05])
plt.legend()
plt.tight_layout()

Out:

/tmp/tmpfl4ra6qp/b0064fe7dbe8048b1d4148e61a568b6fe3fca91b/librosa/display.py:862: MatplotlibDeprecationWarning: The 'basey' parameter of __init__() has been renamed 'base' since Matplotlib 3.3; support for the old name will be dropped two minor releases later.
  scaler(mode, **kwargs)
/tmp/tmpfl4ra6qp/b0064fe7dbe8048b1d4148e61a568b6fe3fca91b/librosa/display.py:862: MatplotlibDeprecationWarning: The 'linthreshy' parameter of __init__() has been renamed 'linthresh' since Matplotlib 3.3; support for the old name will be dropped two minor releases later.
  scaler(mode, **kwargs)
/tmp/tmpfl4ra6qp/b0064fe7dbe8048b1d4148e61a568b6fe3fca91b/librosa/display.py:862: MatplotlibDeprecationWarning: The 'linscaley' parameter of __init__() has been renamed 'linscale' since Matplotlib 3.3; support for the old name will be dropped two minor releases later.
  scaler(mode, **kwargs)

We can do better using the Viterbi algorithm. We’ll use state 0 to indicate silent, and 1 to indicate non-silent. We’ll assume that a silent frame is equally likely to be followed by silence or non-silence, but that non-silence is slightly more likely to be followed by non-silence. This is accomplished by building a self-loop transition matrix, where transition[i, j] is the probability of moving from state i to state j in the next frame.

transition = librosa.sequence.transition_loop(2, [0.5, 0.6])
print(transition)

Out:

[[0.5 0.5]
 [0.4 0.6]]

Our p variable only indicates the probability of non-silence, so we need to also compute the probability of silence as its complement.

full_p = np.vstack([1 - p, p])
print(full_p)

Out:

[[0.666662   0.66666806 0.66667175 0.6666764  0.6666662  0.6666547
66665447 0.6666441  0.6666499  0.6666609  0.6666493  0.6666572
6666585  0.65281963 0.5593039  0.50396335 0.4687572  0.44503105
44209725 0.44649702 0.45015687 0.45296526 0.47192842 0.50088567
533761   0.57154465 0.60663986 0.63306737 0.6560469  0.66306615
6656199  0.66632414 0.66658187 0.6666868  0.6666943  0.6666857
6666565  0.66665673 0.66667044 0.6666879  0.666713   0.6667024
6666863  0.66668093 0.66668844 0.6667025  0.6666882  0.66659105
666566   0.66656137 0.66658425 0.6666702  0.6666781  0.6666808
66664445 0.6666428  0.66665864 0.6666496  0.66597325 0.6332568
565287   0.51315165 0.46493763 0.4289661  0.41747576 0.43100828
4557107  0.44279826 0.40919942 0.36801213 0.33117193 0.32734305
33589602 0.34995198 0.36854565 0.38243645 0.3977514  0.40770727
41529024 0.4371997  0.44843513 0.45536363 0.467516   0.45526844
45985198 0.47037637 0.4920172  0.5279733  0.5675844  0.6191045
65206957 0.6615278  0.66489464 0.66609466 0.66653544 0.6666804
6665775  0.5691235  0.46814352 0.40058243 0.34356403 0.31266928
30578673 0.33672208 0.38994128 0.3435294  0.27476007 0.21300626
1649099  0.15554106 0.15957916 0.15955073 0.16644585 0.17588961
18820965 0.21075332 0.25177717 0.2630303  0.26994652 0.26190835
22955096 0.22823179 0.23443395 0.24548042 0.27291352 0.33201182
40812153 0.5046619  0.61227214 0.6514728  0.6617671  0.66529995
6661643  0.6663485  0.666373   0.6662399  0.666326   0.6662947
66633373 0.6665311  0.66661215 0.6667119  0.66673124 0.658774
5928714  0.5039606  0.45510268 0.419837   0.40622658 0.42472404
44260895 0.46210474 0.47377384 0.48761213 0.49827498 0.5211023
5527549  0.57273364 0.5550959  0.51402926 0.4782467  0.45356333
44303155 0.4642986  0.4916439  0.5032611  0.51925266 0.49856973
47524315 0.47296286 0.4741637  0.47653228 0.4792863  0.48997647
5133821  0.5423028  0.5859402  0.6291387  0.6509739  0.6619408
6645152  0.6653484  0.6654525  0.61951697 0.4705053  0.3777169
32032567 0.284069   0.2920292  0.31536585 0.33947343 0.3635468
38816082 0.41106814 0.4289037  0.45955908 0.46007532 0.45066845
4351374  0.40234917 0.39231366 0.39058787 0.39216518 0.4247653
42306644 0.38793647 0.35002232 0.30539632 0.29092264 0.29614902
31410837 0.3347454  0.3778159  0.4375978  0.5079247  0.60124874
6505161  0.6203172  0.42866933 0.30231488 0.2255764  0.17425478
15799701 0.16310495 0.18772101 0.21032327 0.21690273 0.21076733
18132639 0.16777033 0.17310756 0.18597764 0.20130521 0.21142936
22859299 0.27624983 0.34608388 0.44942784 0.5088073  0.28546447
17546159 0.10721201 0.07273531 0.0705992  0.07627285 0.09409553
11961037 0.16860425 0.24203545 0.3360399  0.50441825 0.6321925
652762   0.66058636 0.66487104 0.6659316  0.6660775  0.43917018
18581885 0.09230238 0.05653769 0.04559535 0.05671984 0.07201564
10393113 0.09749347 0.08757359 0.08931887 0.07750875 0.09155202
11476034 0.13472623 0.1479373  0.15648854 0.14926326 0.12743145
12095761 0.13677686 0.11081856 0.08474755 0.06102765 0.04243284
04023421 0.04600978 0.05659968 0.07429123 0.12322563 0.208314
35310817 0.58769083 0.65173566 0.65917313 0.6629694  0.6642693
6655723  0.66618633 0.6663922  0.66648924 0.66649926 0.6664754
66647923 0.6664742  0.6664442  0.66638243 0.6663059  0.6663143
6663816  0.6663896 ]
 [0.33333805 0.3333319  0.33332822 0.3333236  0.3333338  0.33334526
33334553 0.33335593 0.33335012 0.3333391  0.33335072 0.33334276
33334145 0.34718034 0.44069612 0.49603662 0.5312428  0.55496895
55790275 0.553503   0.54984313 0.54703474 0.5280716  0.49911433
46623895 0.42845535 0.39336014 0.36693263 0.34395307 0.33693385
3343801  0.3336759  0.3334181  0.33331323 0.3333057  0.3333143
33334354 0.33334324 0.3333296  0.3333121  0.333287   0.33329758
3333137  0.33331904 0.33331153 0.33329752 0.3333118  0.33340892
33343402 0.3334386  0.33341572 0.33332983 0.33332193 0.3333192
33335555 0.33335724 0.33334136 0.33335045 0.33402675 0.36674318
434713   0.48684838 0.5350624  0.5710339  0.58252424 0.5689917
5442893  0.55720174 0.5908006  0.63198787 0.66882807 0.67265695
664104   0.650048   0.63145435 0.61756355 0.6022486  0.5922927
58470976 0.5628003  0.5515649  0.54463637 0.532484   0.54473156
540148   0.5296236  0.5079828  0.4720267  0.43241563 0.38089547
34793046 0.3384722  0.33510536 0.33390537 0.33346456 0.3333196
33342248 0.4308765  0.5318565  0.59941757 0.65643597 0.6873307
6942133  0.6632779  0.6100587  0.6564706  0.72523993 0.78699374
8350901  0.84445894 0.84042084 0.8404493  0.83355415 0.8241104
81179035 0.7892467  0.7482228  0.7369697  0.7300535  0.73809165
77044904 0.7717682  0.76556605 0.7545196  0.7270865  0.6679882
5918785  0.49533808 0.38772783 0.3485272  0.3382329  0.33470005
3338357  0.33365148 0.33362702 0.33376008 0.333674   0.33370528
33366627 0.3334689  0.33338785 0.33328804 0.33326873 0.341226
4071286  0.49603936 0.5448973  0.580163   0.5937734  0.57527596
55739105 0.53789526 0.52622616 0.5123879  0.501725   0.47889766
44724515 0.4272664  0.4449041  0.4859707  0.5217533  0.54643667
55696845 0.5357014  0.5083561  0.4967389  0.48074734 0.5014303
52475685 0.52703714 0.5258363  0.5234677  0.5207137  0.51002353
48661795 0.4576972  0.41405982 0.37086132 0.3490261  0.33805922
3354848  0.33465162 0.3345475  0.38048306 0.5294947  0.6222831
6796743  0.715931   0.7079708  0.68463415 0.6605266  0.6364532
6118392  0.58893186 0.5710963  0.5404409  0.5399247  0.54933155
5648626  0.5976508  0.60768634 0.60941213 0.6078348  0.5752347
57693356 0.6120635  0.6499777  0.6946037  0.70907736 0.703851
6858916  0.6652546  0.6221841  0.5624022  0.4920753  0.39875126
3494839  0.3796828  0.57133067 0.6976851  0.7744236  0.8257452
842003   0.83689505 0.812279   0.7896767  0.78309727 0.7892327
8186736  0.8322297  0.82689244 0.81402236 0.7986948  0.78857064
771407   0.7237502  0.6539161  0.55057216 0.49119267 0.71453553
8245384  0.892788   0.9272647  0.9294008  0.92372715 0.9059045
88038963 0.83139575 0.75796455 0.6639601  0.49558178 0.36780748
347238   0.3394136  0.33512896 0.3340684  0.3339225  0.5608298
81418115 0.9076976  0.9434623  0.95440465 0.94328016 0.92798436
8960689  0.90250653 0.9124264  0.9106811  0.92249125 0.908448
88523966 0.8652738  0.8520627  0.84351146 0.85073674 0.87256855
8790424  0.86322314 0.88918144 0.91525245 0.93897235 0.95756716
9597658  0.9539902  0.9434003  0.9257088  0.8767744  0.791686
64689183 0.41230914 0.34826434 0.3408269  0.3370306  0.3357307
33442768 0.3338137  0.33360776 0.3335108  0.33350074 0.33352455
33352077 0.33352575 0.33355585 0.33361757 0.33369413 0.3336857
3336184  0.3336104 ]]

Now, we’re ready to decode! We’ll use viterbi_discriminative here, since the inputs are state likelihoods conditional on data (in our case, data is rms).

states = librosa.sequence.viterbi_discriminative(full_p, transition)

# sphinx_gallery_thumbnail_number = 5
plt.figure(figsize=(12, 6))
ax = plt.subplot(2,1,1)
librosa.display.specshow(librosa.amplitude_to_db(S_full, ref=np.max),
                         y_axis='log', x_axis='time', sr=sr)
plt.xlabel('')
ax.tick_params(labelbottom=False)
plt.subplot(2, 1, 2, sharex=ax)
plt.step(times, p>=0.5, label='Frame-wise')
plt.step(times, states, linestyle='--', color='orange', label='Viterbi')
plt.xlabel('Time')
plt.axis('tight')
plt.ylim([0, 1.05])
plt.legend()

Out:

/tmp/tmpfl4ra6qp/b0064fe7dbe8048b1d4148e61a568b6fe3fca91b/librosa/display.py:862: MatplotlibDeprecationWarning: The 'basey' parameter of __init__() has been renamed 'base' since Matplotlib 3.3; support for the old name will be dropped two minor releases later.
  scaler(mode, **kwargs)
/tmp/tmpfl4ra6qp/b0064fe7dbe8048b1d4148e61a568b6fe3fca91b/librosa/display.py:862: MatplotlibDeprecationWarning: The 'linthreshy' parameter of __init__() has been renamed 'linthresh' since Matplotlib 3.3; support for the old name will be dropped two minor releases later.
  scaler(mode, **kwargs)
/tmp/tmpfl4ra6qp/b0064fe7dbe8048b1d4148e61a568b6fe3fca91b/librosa/display.py:862: MatplotlibDeprecationWarning: The 'linscaley' parameter of __init__() has been renamed 'linscale' since Matplotlib 3.3; support for the old name will be dropped two minor releases later.
  scaler(mode, **kwargs)

Note how the Viterbi output has fewer state changes than the frame-wise predictor, and it is less sensitive to momentary dips in energy. This is controlled directly by the transition matrix. A higher self-transition probability means that the decoder is less likely to change states.

Total running time of the script: ( 0 minutes 2.488 seconds)

Gallery generated by Sphinx-Gallery