The Infona portal uses cookies, i.e. strings of text saved by a browser on the user's device. The portal can access those files and use them to remember the user's data, such as their chosen settings (screen view, interface language, etc.), or their login data. By using the Infona portal the user accepts automatic saving and using this information for portal operation purposes. More information on the subject can be found in the Privacy Policy and Terms of Service. By closing this window the user confirms that they have read the information on cookie usage, and they accept the privacy policy and the way cookies are used by the portal. You can change the cookie settings in your browser.
To identify the attended speaker from single-trial EEG recordings in an acoustic scenario with two competing speakers, an auditory attention decoding (AAD) method has recently been proposed. The AAD method requires the clean speech signals of both the attended and the unattended speaker as reference signals for decoding. However, in practice only the binaural signals, containing several undesired...
Ste gano graphic systems are used for the transmission of hidden data in the original signal. The article describes the algorithm of the hidden data transmission using the speech signal as a carrier. The echo method is used for data embedding. In order to improve the decoding efficiency of embedded data, the procedure of voicing correction and mechanism of informed coding were developed and implemented...
Convolutive Non-Negative Matrix Factorization model factorizes a given audio spectrogram using frequency templates with a temporal dimension. In this paper, we present a convolutional auto-encoder model that acts as a neural network alternative to convolutive NMF. Using the modeling flexibility granted by neural networks, we also explore the idea of using a Recurrent Neural Network in the encoder...
Research shows that speech dereverberation (SD) with Deep Neural Network (DNN) achieves the state-of-the-art results by learning spectral mapping, which, simultaneously, lacks the characterization of the local temporal spectral structures (LTSS) of speech signal and calls for a large storage space that is impractical in real applications. Contrarily, the Convolutional Neural Network (CNN) offers a...
In this paper, Transcriber that can be used to automatically transcribe interviews in Indonesian using speech-to-text and speaker diarization technology is described. The main feature of the software is generating interview transcription automatically and providing an option if grouping by group of speakers is required. Transcriber is designed to work in two modes that give users the freedom to provide...
Hybrid digital-analog (HDA) architectures have been widely developed for efficient digital transmission of analog speech, audio or video data. By considering the advantage of both digital and analog components, HDA systems gain better performances than purely analog and digital schemes in a wide range of channel conditions. However, HDA systems described in previous works are mostly designed for continuous-valued...
In this paper we describe a systematic procedure to implement two-stage based keywords spotting system (KWS). In first stage, a phonetic decoding of continuous speech is obtained using a CD-DNN-HMM model built with the Kaldi toolkit. In second stage, these results of phonetic transcriptions will serve to construct a system to search the keywords embedded in continuous speech using the classification...
In harsh channel conditions, the quality of the synthetic speech at low bit rate would be affected severely. In order to improve the robustness of the vocoder and make it more resilient to errors in random channel, unequal error protection (UEP) channel coding is usually adopted. However, when the errors cannot be corrected in some cases, UEP channel coding will not improve the quality of the synthetic...
Equipped with selective auditory attention (SAA), people are able to rapidly shift their attention to auditory events of interest. Although abstract neuroimaging paradigms are fundamental for exploring the neural basis of SAA, whether those findings are valid in a more naturalistic condition and how the types of auditory stimuli affect SAA are largely unknown. Here we propose a brain decoding study...
Most traditional template matching based keyword recognition methods don't need training data, just rely on frame matching. However, the recognition speed is relatively slow and it can't be used in practice. The LVCSR-based method needs to convert the speech signal into text signal before recognition, which has an important impact on the final recognition performance. In this paper, we propose a method...
With the completion of the IARPA Babel program, it is possible to systematically analyze the performance of speech recognition systems across a wide variety of languages. We select 16 languages from the dataset and compare performance using a deep neural network-based acoustic model. The focus is on keyword spotting using the actual term-weighted value (ATWV) metric. We demonstrate that ATWV is keyword...
A multi-stream framework with deep neural network (DNN) classifiers is applied to improve automatic speech recognition (ASR) in environments with different reverberation characteristics. We propose a room parameter estimation model to establish a reliable combination strategy which performs on either DNN posterior probabilities or word lattices. The model is implemented by training a multilayer perceptron...
Spoken term detection, especially of out-of-vocabulary (OOV) keywords, benefits from the use of sub-word systems. We experiment with different language-independent approaches to sub-word unit generation, generating both syllable-like and morpheme-like units, and demonstrate how the performance of syllable-like units can be improved by artificially increasing the number of unique units. The effect...
Current diarization algorithms are commonly applied to the outputs of single non-moving microphones. They do not explicitly identify the content of overlapped segments from multiple speakers or acoustic events. This paper presents an acoustic environment aware child-adult diarization applied to the audio recorded by a single microphone attached to moving targets under realistic high noise conditions...
Patients with locked-in-syndrome (fully paralyzed but aware) struggle in their life and communication. Providing a level of communication offers these patients a chance to resume a meaningful life. Current brain-computer interface (BCI) communication requires users to build words from single letters selected on a screen, which is extremely inefficient. Faster approaches for their speech communication...
This article addresses the problem of continuous speech recognition from visual information only, without exploiting any audio signal. Our approach combines a video camera and an ultrasound imaging system for monitoring simultaneously the speaker's lips and the movement of the tongue. We investigate the use of convolutional neural networks (CNN) to extract visual features directly from the raw ultrasound...
This paper presents a novel far-field voice trigger algorithm utilizing DNN with the objective function of state-level minimum Bayes risk for training, customizing the decoding network to absorb the ambient noise and background speech. We adopt a two-stage classification strategy to integrate the phonetic knowledge and model-based classification into detecting wake-up words. Experimental results of...
In this paper we present an extension of our previously described neural machine translation based system for punctuated transcription. This extension allows the system to map from per frame acoustic features to word level representations by replacing the traditional encoder in the encoder-decoder architecture with a hierarchical encoder. Furthermore, we show that a system combining lexical and acoustic...
In this paper we aim to enhance keyword search for conversational telephone speech under low-resourced conditions. Two techniques to improve the detection of out-of-vocabulary keywords are assessed in this study: using extra text resources to augment the lexicon and language model, and via subword units for keyword search. Two approaches for data augmentation are explored to extend the limited amount...
Adapting acoustic models to speakers have shown to greatly improve performance for many tasks. Among the adaptation approaches, exploiting auxiliary features characterizing speakers or environments has received great attention because they allow rapid adaptation, i.e. adaptation with limited amount of speech data such as a single utterance. However, the auxiliary features are usually computed in batch...
Set the date range to filter the displayed results. You can set a starting date, ending date or both. You can enter the dates manually or choose them from the calendar.