Neural networks : the official journal of the International Neural Network Society | 2021

Learning to recognize while learning to speak: Self-supervision and developing a speaking motor.

 
 

Abstract


Traditionally, learning speech synthesis and speech recognition were investigated as two separate tasks. This separation hinders incremental development for concurrent synthesis and recognition, where partially-learned synthesis and partially-learned recognition must help each other throughout lifelong learning. This work is a paradigm shift-we treat synthesis and recognition as two intertwined aspects of a lifelong learning agent. Furthermore, in contrast to existing recognition or synthesis systems, babies do not need their mothers to directly supervise their vocal tracts at every moment during the learning. We argue that self-generated non-symbolic states/actions at fine-grained time level help such a learner as necessary temporal contexts. Here, we approach a new and challenging problem-how to enable an autonomous learning system to develop an artificial speaking motor for generating temporally-dense (e.g., frame-wise) actions on the fly without human handcrafting a set of symbolic states. The self-generated states/actions are Muscles-like, High-dimensional, Temporally-dense and Globally-smooth (MHTG), so that these states/actions are directly attended for concurrent synthesis and recognition for each time frame. Human teachers are relieved from supervising learner s motor ends. The Candid Covariance-free Incremental (CCI) Principal Component Analysis (PCA) is applied to develop such an artificial speaking motor where PCA features drive the motor. Since each life must develop normally, each Developmental Network-2 (DN-2) reaches the same network (maximum likelihood, ML) regardless of randomly initialized weights, where ML is not just for a function approximator but rather an emergent Turing Machine. The machine-synthesized sounds are evaluated by both the neural network and humans with recognition experiments. Our experimental results showed learning-to-synthesize and learning-to-recognize-through-synthesis for phonemes. This work corresponds to a key step toward our goal to close a great gap toward fully autonomous machine learning directly from the physical world.

Volume 143
Pages \n 28-41\n
DOI 10.1016/j.neunet.2021.05.006
Language English
Journal Neural networks : the official journal of the International Neural Network Society

Full Text