comp2



comp2

1 0


comp2

Presentation for Comprehensive 2

On Github tbekolay / comp2

Biologically inspired methods inspeech recognition and synthesis:

Closing the loop

Trevor Bekolay Centre for Theoretical Neuroscience Follow along: http://bekolay.org/comp2

https://www.youtube.com/watch?v=uKKpjvPd6Xo

$$\underset{u}{\operatorname{argmin}} \: \overbrace{\sum_{1,n} C(u_n, t_n)}^{\color{blue}{\text{Target cost}}} \:\: \overbrace{\sum_{2,n} C(u_{n-1}, u_n)}^{\color{blue}{\text{Join cost}}}$$
Lots of data + lots of compute power

Goal

Train an articulatory synthesizer to repeat utterances.

Method

\begin{eqnarray} \mathbf{x} \Rightarrow & \text{End-effector position} & \Rightarrow \text{Auditory features} \\ \mathbf{q} \Rightarrow & \text{Joint coordinates} & \Rightarrow \text{Control parameters} \\ \end{eqnarray}
$\text{Control signal:} \quad (\mathbf{q}, \mathbf{\dot{q}}, \mathbf{\ddot{x}_d}) \rightarrow \mathbf{u}$
$$\text{Evaluation: } E = \int d(\mathbf{x_d}, \mathbf{x})$$

A phenomenological model of the synapse between the inner hair cell and auditory nerve: long-term adaptation with power-law dynamics.Zilany, et al. The Journal of the Acoustical Society of America, 126:2390–2412, 2009.

http://www.cs.colostate.edu/~ericson/ericsonFinal.pdf A computational model of filtering, detection, and compression in the cochlea.R. F. Lyon. In Proceedings of IEEE-ICASSP-82, 1282-1285, 1982.

Modeling consonant-vowel coarticulation for articulatory speech synthesis.Birkholz. PloS one, 8(4):e60603, 2013.

Normal Angry Scared

http://www.vocaltractlab.de/index.php?page=vocaltractlab-examples http://www.vocaltractlab.de/index.php?page=birkholz-supplements

http://www.vocaltractlab.de/index.php?page=vocaltractlab-examples http://www.vocaltractlab.de/index.php?page=birkholz-supplements

Name Description Min. Max Unit $HX$ Horizontal hyoid position 0.0 1.0 $HY$ Vertical hyoid position -6.0 -3.4 cm $JX$ Horizontal jaw displacement -0.5 0.0 cm $JA$ Jaw angle -7.0 0.0 deg $LP$ Lip protrusion -1.0 1.0 $LD$ Vertical lip distance -2.0 4.0 cm $VS$ Velum shape 0.0 1.0 $VO$ Velic opening -0.1 1.0 $TCX$ Tongue body center X -3.0 4.0 cm $TCY$ Tongue body center Y -3.0 1.0 cm $TTX$ Tongue tip X 1.5 5.5 cm $TTY$ Tongue tip Y -3.0 2.5 cm $TBX$ Tongue blade X -3.0 4.0 cm $TBY$ Tongue blade Y -3.0 5.0 cm $TRX$ Tongue root X -4.0 2.0 cm $TRY$ Tongue root Y -6.0 0.0 cm $TS1$ Tongue side elevation 1 -1.4 1.4 cm $TS2$ Tongue side elevation 2 -1.4 1.4 cm $TS3$ Tongue side elevation 3 -1.4 1.4 cm $TS4$ Tongue side elevation 4 -1.4 1.4 cm $MA1$ Minimum area tongue back region 0.0 0.3 cm$^2$ $MA2$ Minimum area tongue tip region 0.0 0.3 cm$^2$ $MA3$ Minimum area lip region 0.0 0.3 cm$^2$

http://studywolf.wordpress.com/

\begin{eqnarray} \min_u C(\mathbf{u}) =& \mathbf{u^T N u} \quad \text{s.t. } \mathbf{J \ddot{q}} = \mathbf{\ddot{x}}_\text{ref} - \mathbf{\dot{J} \dot{q}} \\ \mathbf{\ddot{x}}_\text{ref} =& \mathbf{\ddot{x}_d} + \mathbf{K}_d (\mathbf{\dot{x}}_d - \mathbf{\dot{x}}) + \mathbf{K}_p (\mathbf{x}_d - \mathbf{x}) \end{eqnarray}

Learning to control in operational space. Peters & Schaal. The International Journal of Robotics Research 27(2), 197-212, 2008

Extension 1

Generalize several utterances of the same category

Evaluation: User studies Learning movement primitives. Schaal et al.Robotics Research 561-572, 2005. Learning attractor landscapes for learning motor primitives. Ijspeert et al. NIPS 2003: 1547-1554, 2003.

Extension 2

Classify utterances corresponding to different categories

Extensions of recurrent neural network language model. Mikolov et al. In ICASSP 2011, 2011. Learning recurrent neural networks with hessian-free optimization. Martens & Sutskever. ICML '11, 2011.

Thank you

This presentation: http://bekolay.org/comp2

My progress: http://github.com/tbekolay/audition

0