The Infona portal uses cookies, i.e. strings of text saved by a browser on the user's device. The portal can access those files and use them to remember the user's data, such as their chosen settings (screen view, interface language, etc.), or their login data. By using the Infona portal the user accepts automatic saving and using this information for portal operation purposes. More information on the subject can be found in the Privacy Policy and Terms of Service. By closing this window the user confirms that they have read the information on cookie usage, and they accept the privacy policy and the way cookies are used by the portal. You can change the cookie settings in your browser.
Cry segmentation is an essential preprocessing step in any infant crying diagnosis system. Besides crying sounds consisting of expiration phases followed by short periods of inspiration episodes, each recording of newborn cries also includes silence sections as well as other sounds such as speech of caregivers, noise and sound of medical equipments. This paper is devoted to a newly developed Empirical...
The robustness of speech recognizers towards noise can be increased by normalizing the statistical moments of the Mel-frequency cepstral coefficients (MFCCs), e. g. by using cepstral mean normalization (CMN) or cepstral mean and variance normalization (CMVN). The necessary statistics are estimated over a long time window and often, a complete utterance is chosen. Consequently, changes in the background...
Human voice can serve as a password/key for access to various services. This voice is used for verifying speaker in speaker verification system based on the features extracted from the voice signal. In automated speaker verification the speaker's voice signal is processed to extract speaker-specific information which is used to generate voiceprint also known as a template that cannot be replicated...
The GRBAS scale is a widely used subjective measure of voice quality. The aim of this paper is to investigate the correlation between the 'grade', 'roughness', 'breathiness', 'asthenia' and 'strain' dimensions of this scale and the objective measurements provided by the 'Analysis of Dysphonia in speech and Voice' (ADSV) software package. To do this, voice recordings of 107 samples were collected in...
Parkinson's disease (PD) is a neurodegenerative disorder that is characterized by the loss of dopaminergic neurons in the mid brain. It is demonstrated that about 90% of the people with PD also develop speech impairments, exhibiting symptoms such as monotonic speech, low pitch intensity, inappropriate pauses, imprecision in consonants and problems in prosody; although they are already identify problems,...
This paper presents the development of a speech recognition system for automatically recognizing fluently spoken digit strings in Northern Sotho. The digit strings can be isolated or connected/continuous with known or unknown length. The digit recognition system has been trained with the aim of satisfying its potential end-users. Our main research focus was to enhance the robustness of a connected-digits...
This paper presents a phone segmentation method without a prior knowledge about the text contents. The proposed method is an unsupervised phone boundary detection based on band-energy tracing technique. It demonstrates a better performance than those previous works when the method was applied to TIMIT corpus. But the performance degrades when the method is applied to a Mandarin Chinese speech database,...
In this paper we introduce a new cepstral coefficient extraction method based on an intelligibility measure for speech in noise, the Glimpse Proportion measure. This new method aims to increase the intelligibility of speech in noise by modifying the clean speech, and has applications in scenarios such as public announcement and car navigation systems. We first explain how the Glimpse Proportion measure...
Unsupervised phone segmentation means that the phone boundaries in an utterance can be detected without a prior knowledge about the text contents. Usually, a spectral change in the speech signal implies the existence of a phone boundary. In this paper, the Delta Spectral Function (DSF) is defined for each frame to represent the variation of band energy for a specific band. Then a number of bands that...
In this work we present a scalable feature set which is obtained by fitting orthogonal polynomials to the normalized modulation spectrum of cepstral coefficients and which can be easily adapted to different classification tasks. The performance of the feature set is investigated in a hierarchically structured audio signal classification experiment and compared with other approaches reported in the...
Previous work in speech-based cognitive load classification has shown that the glottal source contains important information for cognitive load discrimination. However, the reliability of glottal flow features depends on the accuracy of the glottal flow estimation, which is a non-trivial process. In this paper, we propose the use of acoustic voice source features extracted directly from the speech...
While much work has been dedicated to exploring how best to incorporate the Ideal Binary Mask (IBM) in automatic speech recognition (ASR) for noisy signals, we demonstrate that the simple use of masked speech can outperform standard spectral reconstruction methods. We explore the effects of both the accuracy of the mask estimation and the strength of the language model on our results. The relative...
Ambulatory devices can be used to detect heart diseases and save lives in critical time. These devices are based on sound classification that usually adopts a suitable data mining algorithm. This paper investigates the performance of Support Vector Machine (SVM) and Gaussian Mixture Model (GMM) classifiers in classifying sound samples. SVM classifier makes use of a linearly separable hyperplane to...
This paper compares the feature sets extracted using time-frequency analysis approach and frequency-time analysis approach for text-independent speaker identification. Mel-frequency cepstral coefficient (MFCC) feature set and Inverted Mel-frequency cepstral coefficient (IMFCC) feature set are extracted using time-frequency analysis approach. Temporal energy subband cepstral coefficient (TESBCC) feature...
Mel Frequency Cepstral Coefficients (MFCC) are widely used in speech recognition and speaker identification. MFCC features are usually pre-processed before being used for recognition. One of these pre-processing is creating delta and delta-delta coefficients and append them to MFCC to create feature vector. Another pre-processing is coefficients mean normalization. In this paper, the effect of these...
This paper describes several Sound-Packet segmentation techniques, which will facilitate Automatic Speech Recognition (ASR) for Bangla speech signal. The approximate duration of a sound-packet has been determined and an envelope-detection method has been presented to determine the end-points of sound-packets. The 1st difference method, based on moving average of 1st difference of the signal, is then...
This paper proposes a new robust speech recognition method. Since the hidden Markov model (HMM) algorithm need a lot of training calculation, The dynamic time warping (DTW) algorithm based on median filter is used instead in our system. According to the short-term energy method, the non-speech segment can be removed. Recognition accuracy is thus improved. The cepstral mean subtraction (CMS), running...
Automatic recognition of emotional states via speech signal has attracted increasing attention in recent years. A number of techniques have been proposed which are capable of providing reasonably high accuracy for controlled studio settings. However, their performance is considerably degraded when the speech signal is contaminated by noise. In this paper, we present a framework with adaptive noise...
Auditory based front-ends for speech recognition have been compared before, but this paper focuses on two of the most promising algorithms for noise robustness in automatic speech recognition (ASR). The feature sets are Zero-Crossings with Peak Amplitudes (ZCPA) and the recently introduced Power-Law Nonlinearity and Power-Bias Subtraction (PNCC). Standard Mel-Frequency Cepstral Coefficients (MFCC)...
In this paper, we report the influence that classification accuracies have in speech analysis from a clinical dataset by adding acoustic low-level descriptors (LLD) belonging to prosodic (i.e. pitch, formants, energy, jitter, shimmer) and spectral features (i.e. spectral flux, centroid, entropy and roll-off) along with their delta (Δ) and delta-delta (Δ-Δ) coefficients to two baseline features of...
Set the date range to filter the displayed results. You can set a starting date, ending date or both. You can enter the dates manually or choose them from the calendar.