The Infona portal uses cookies, i.e. strings of text saved by a browser on the user's device. The portal can access those files and use them to remember the user's data, such as their chosen settings (screen view, interface language, etc.), or their login data. By using the Infona portal the user accepts automatic saving and using this information for portal operation purposes. More information on the subject can be found in the Privacy Policy and Terms of Service. By closing this window the user confirms that they have read the information on cookie usage, and they accept the privacy policy and the way cookies are used by the portal. You can change the cookie settings in your browser.
The aim of the spoken term detection task is to find the occurrence of user-entered keywords in an archive of audio recordings. The kind of techniques that are used usually are vocabulary-independent, using only the acoustic information available. In this scenario, however, we rely exclusively on the acoustic model, which is a drawback when it is unreliable; for example when the input is noisy. In...
This paper presents an audio keywords detection method for highlight retrieval in basketball video. The keywords contain shoes squeaking sound, speech, cheer, long whistle and short whistle, which correspond to basketball game events. After feature analysis, the Simple Excellent Feature Combination based on Pearson Correlation Coefficient (SEFC-PCC) is used to select efficient features, which contributes...
Most existing research in the area of emotions recognition has focused on short segments or utterances of speech. In this paper we propose a machine learning system for classifying the overall sentiment of long conversations as being Positive or Negative. Our system has three main phases, first it divides a call into short segments, second it applies machine learning to recognize the emotion for each...
This paper presents a system that gives a mobile robot the ability to recognize target speaker's speech, even if the robot performs an action and there are multiple speakers talking in the room. Associated problems to this system are twofold: (1) While the robot is moving, the joints inevitably generate ego-motion noise due to its motors. (2) Recognizing target speech against other interfering speech...
Different kinds of features in time domain, spectral domain and cepstral domain are used for musical genre classification. In this paper, through the fusion of short-term timbral features and long-term rhythmic feature, we propose a novel method where: musical genre vector is constructed using the likelihood ratio of GMM (Gaussian Mixture Model) and radar chart is applied to provide visualized style...
Auditory based front-ends for speech recognition have been compared before, but this paper focuses on two of the most promising algorithms for noise robustness in automatic speech recognition (ASR). The feature sets are Zero-Crossings with Peak Amplitudes (ZCPA) and the recently introduced Power-Law Nonlinearity and Power-Bias Subtraction (PNCC). Standard Mel-Frequency Cepstral Coefficients (MFCC)...
The automatic classification of audio data is an effective way to organize a large-scale audio data files. In this paper, an automatic content-based audio classification model using Centroid Neural Networks (CNN) with a Divergence Measure is proposed. The Divergence-based Centroid Neural Network (DCNN) algorithm, which employs the divergence measure as its distance measure, is used for clustering...
A novel framework for background music identification is proposed in this paper. Given a piece of audio signals that mixes background music with speech/noise, we identify the music part with source music data. Conventional methods that take the whole audio signals for identification are inappropriate in terms of efficiency and accuracy. In our framework, the audio content is filtered through speech...
This paper addresses the task of automatically detecting outcomes of social interaction patterns, using non-verbal audio cues in competitive role-playing games (RPGs). For our experiments, we introduce a new data set which features 3 hours of audio-visual recordings of the popular “Are you a Werewolf?” RPG. Two problems are approached in this paper: Detecting lying or suspicious behavior using non-verbal...
We describe experiments in visual-only language identification (VLID), in which only lip shape, appearance and motion are used to determine the language of a spoken utterance. In previous work, we had shown that this is possible in speaker-dependent mode, i.e. identifying the language spoken by a multi-lingual speaker. Here, by appropriately modifying techniques that have been successful in audio...
The Partitioned Feature-based Classifier (PFC) is proposed in this paper. PFC does not use entire feature vectors extracted from the original data at once to classify each datum, but use only groups of features related to each feature vector to classify data separately. In the training stage, the contribution rate calculated from each feature vector group is drawn throughout the accuracy of each feature...
Sound localization systems (SLS) identify the direction of a sound source. However, most of approaches focus on near-field identification, i.e. 1~2 m. In this paper we develop a novel algorithm for far-field sound localization based on the average magnitude difference function (AMDF), thereby extending the distance to 5 m. The far-field SLS is implemented on a field programmable gate array (FPGA)...
With the rapid growth in audio data volume, research in the area of content-based audio retrieval has gained impetus in the last decade. Audio classification serves as the fundamental step towards it. Accuracy in classifying data relies on the strength of the features and on the efficacy of classification scheme. In this work, we have focused on the features only. We have restricted ourselves further...
Pitch period is the important parameters of speech recognition and speech synthesis. Pitch period detection has been focus in the field of audio processing research. Traditional AMDF-based algorithm and its improved version, LV-AMDF-based algorithm easily leads to the double error or half error, and so on in the pitch detection. To solve these problems, AMDF, LV-AMDF function characteristics and shortcomings...
In this paper, the spectral characteristics of uninterpolated and interpolated signals are analyzed, and a new audio spectral measure, band-partitioning spectral smoothness (BPSS) is proposed. For interpolated signals, the spectral smoothness in the high frequency band produced by interpolation is much smaller than in the other frequency band. And then the signal spectrum is partitioned into several...
This paper presents a model-free and training-free two-phase method for audio segmentation that separates monophonic heterogeneous audio files into acoustically homogeneous regions where each region contains a single sound. A rough segmentation separates audio input into audio clips based on silence detection in the time domain. Then a self-similarity matrix, based on selected audio features in the...
Audio is a useful modality complement to video for healthcare monitoring. In this paper, we investigate the use of hierarchical hidden Markov models (HHMMs) for healthcare audio event classification. We show that HHMM can handle audio events with recursive patterns to improve the classification performance. We also propose a model fusion method to cover large variations often existing in healthcare...
The human voice is primarily a carrier of speech, but it also contains non-linguistic features unique to a speaker and indicative of various speaker demographics, e.g. gender, nativity, ethnicity. Such characteristics are helpful cues for audio/video search and retrieval. In this paper, we evaluate the effects of various low-, mid-, and high-level features for effective classification of speaker characteristics...
Provisioning of mobile audio and video services is a difficult challenge since in the mobile environment, bandwidth and processing resources are limited. Audio content is normally present in most multimedia services, however, the user expectation of perceived audio quality differs for speech and nonspeech content. Therefore, automatic voice or speech detection is needed in order to maximize perceived...
This paper presents novel feature-group for on-line speech/music segmentation for broadcast news domain. The features are based on mel-frequency cepstral coefficients variance (MFCCV). The idea behind the feature-group construction is the energy variation in a narrow frequency sub-band. The variation is bigger for speech than for music. For feature discrimination and segmentation ability evaluation...
Set the date range to filter the displayed results. You can set a starting date, ending date or both. You can enter the dates manually or choose them from the calendar.