Ayuda
Ir al contenido

Dialnet


Resumen de Contributions to speech and language processing towards automatic speech recognizers with evolving dictionaries

Alejandro Coucheiro Limeres

  • Automatic speech recognizers offer nowadays a notable performance in multiple and challenging tasks. This has propitiated their incorporation in user applications that are used very frequently. Such presence in the habitual human life has been reinforced due to the fact that speech is one of the most natural ways of human communication. And thus, our interaction with machines is evolving towards a speech-based communication, for accomplishing both simple tasks like dictation or web searches, and more complex tasks like human-machine dialogs for configuring e.g. domestic appliances. The acoustic and language conditions of the audios that serve as input for these systems can vary enormously from source to source. Thus, a great work into improving their acoustic and language models has kept developing continuously since the last decades, where a recent impulse is being experimented thanks to the resurgence of neural networks. On the acoustic part, the effort is focusing on dealing with any kind of noises that may be present on the audio, on being able to recognize the speech of any kind of speaker (including different accents and any other varying feature of the human voice) and on addressing any additional condition of the speech process, like far-field recognition. On the language part, the effort is focusing on how to manage the large amount of different topics and domains that can be present in the speech. Beyond a brute-force approach, where a huge vocabulary and language model is used to try to face any of these scenarios, more interest is growing for systems that can restrict their modeling capabilities to recognize optimally certain topics or domains. This restriction allows the language models to be centered in the uses of language of interest, which otherwise would result in an inoperably big model (and likely less accurate). Furthermore, a variable language restriction which keeps tuning the model to the language characteristics of the current speech would offer the recognition system an adaptation capability able to reach optimal performance regardless the topic of the input speech. This kind of adaptation is the main focus of this thesis work. More specifically, we are interested in an automatic and unsupervised adaptation to the current speech which does not require an explicit identification of the current topic nor domain which would drive the adaptation process. Instead, we want the adaptation to be driven by the vocabulary that is employed in the speech, so as to tune accordingly the dictionaries of the recognition systems (with effects in both their vocabularies and language models). Apart from the in-vocabulary (IV) words that we may be able to decode from the speech, we are more interested in the words that were not present in the current vocabulary of the systems, or out-of-vocabulary (OOV) words, as they can indicate implicitly changes of topic or domain. Thus, we propose strategies for detecting OOV terms in the speech, finding the best candidate words for them and ultimately learning about its syntax and semantics so they can be incorporated properly into the recognition systems, causing modifications that result in an adaptation, considering together various OOVs appearances and resolutions. The strategies that we propose in this thesis work can be divided in two levels of operations, one that works in a local, static level, and one that works in a dynamic, evolving level. They are called Term Discovery strategy and dynamic Term Discovery strategy, respectively. The processes involved in the Term Discovery strategy can be enumerated as follows: (1) Detection of OOV terms that might appear in the input speech. We have contributed with a couple of OOV detection methods that can work in conjunction. The first method employs a OOV word model defined in both acoustic and language terms, while the second method is based on confidence analyses over the output word lattice that is delivered by the recognition systems. (2) Search for candidate words for every OOV detection. We have contributed with a search scheme that performs two different kind of searches, one that is acoustically driven, and another that is semantically driven. For this scheme, we take advantage of external knowledge sources where to find the best candidates. Furthermore, we also proposed a distributed representation scheme for the resources found in graph-organized corpora. These type of representations can benefit this search, and could be also employed in other semantic tasks like those found in the area of natural language processing and understanding. (3) Correction of the output transcription with the best candidate found, if any. We have contributed with a series of candidate scores that can improve the decision whether a candidate is suitable enough to substitute the original content of a detected OOV region. And as regards the dynamic Term Discovery strategy, we proposed a series of processes to be executed iteratively when needed, as a reaction to the speech being decoded: (1) Continuous collection of the terms retrieved by the plain Term Discovery strategy, so as to assess whether some terms become interesting to be added to the system's vocabulary. We have contributed with a scoring scheme that takes into account the scores that the plain Term Discovery strategy gave to a word and also the time passed between the moments when that word was retrieved, so as to give more importance to the new terms used more recently. (2) Selection of the most interesting new terms from the previous collection to add to the system's vocabulary, and also selection of the least interesting IV terms that will be removed from the vocabulary. We have contributed with schema for both selecting terms to add to and delete from the vocabulary. For the words to add, we verify whether there are enough trainable material in external sources about the new terms, and we can also reconsider whether the transcription corrections made by a term were reliable enough, so as to refine our decision. And for the word to remove, we consider both which IV words are not being employed enough in the input speech, and which words do not fit sufficiently the current state of the language model of the system. (3) Update of the vocabulary and language model of the recognition systems considering the previous word selections. We have contributed with an update scheme that considers both the previous language model of the system and a language model built with texts from external knowledge sources that contain the new terms, proposing as well an interpolation scheme of both models so as to produce a new language model with the designated vocabulary. The proposed strategies have been evaluated under realistic experimental frameworks, where we employed state-of-the-art automatic speech recognizers, designed with different sizes of the vocabulary, and large external knowledge sources. The speech test corpora contains a great variety of speakers and natural, spontaneous speech, where multiple topics are discussed. Such features are in consonance with the conditions for which we want our strategies to offer benefits. The evaluation results allow us to measure how noticeable are the improvements given to the recognition systems that are equipped with our strategies, in comparison with the systems which lack them. In fact, we were able to achieve significant improvements over the baseline systems for both strategies and in most of the experimental configurations. Lastly, we were also able to study the behavior of the dynamic systems over time, in order to assess how fast and in which manner the desired adaptation is happening. We observed that, on average, the dynamic systems were capable of offering significant improvements over the baseline systems (and even over the systems equipped with the first, static strategy) just after few hours of operation and for very different speech corpora. Also, the type and number of words that the dynamic systems added/deleted over time were consistent with the expected behavior (mainly, adding words that actually appeared in the input speech and removing those that did not fit enough in the current state of the evolving language model), which accounts for the notable benefits that were measured.


Fundación Dialnet

Dialnet Plus

  • Más información sobre Dialnet Plus