The Infona portal uses cookies, i.e. strings of text saved by a browser on the user's device. The portal can access those files and use them to remember the user's data, such as their chosen settings (screen view, interface language, etc.), or their login data. By using the Infona portal the user accepts automatic saving and using this information for portal operation purposes. More information on the subject can be found in the Privacy Policy and Terms of Service. By closing this window the user confirms that they have read the information on cookie usage, and they accept the privacy policy and the way cookies are used by the portal. You can change the cookie settings in your browser.
In the first part of this study, the basic concepts of forensic phonetics such as voice, speech, and voice track are explained. In the second part; visual and auditory montage detection methods used in forensic phonetics, one of the lower branches of digital forensics, were examined. The most frequently used visual and auditory analysis methods have been determined by examining the literature. Then...
Due to the increasing growth of the web, these days Internet is broadly utilized by users to fulfill different data needs. Sometimes, more precise information related to specific streams such as Healthcare is not available on the internet that satisfies the user's information need. There is a specific category of users such as doctors who really interested in the videos related to disease diagnosis...
Gaze analysis in dynamic environments has remained an unresolved problem due to the complexities that pertain to the detection and tracking of objects in the visual environment. This study provides a solution to the problem for face-to-face communication, in which the visual objects in the environment are faces. The application that has been developed for this purpose is able to detect and track faces...
One way to visualize an intercultural dialogue is to plot keywords jointly used by the intercultural speakers to see how the keywords locate relatively to each other, with the position of the keywords signifying some kind of a similarity relationship. We processed a Japanese transcription of a Korean-Japanese dialogue using Word2Vec and t-SNE algorithm to generate various 2D plots of the noun words...
Use of surveillance cameras as a monitoring tool for home environments, elderly, and children has becoming a common practice. However, people with visual impairments have difficulties in using this kind of device because it relies only on visual information. Towards solving this problem, this work aims to propose a solution that combines deep learning techniques for object recognition in the video...
Depression is a cognitive impairment, which according to the World Health Organisation is the leading cause of disability worldwide. One key trait of depression is psychomotor retardation, which adversely affects both emotional and physical behaviour of an individual. In this paper we perform experiments on the Audio Visual Emotion recognition Challenge 2016 — Depression Classification sub-Challenge...
In this paper, the problem of age estimation is addressed based on two modalities: speech utterances and speakers' face images. The proposed age estimation framework employs the Shifted Covariates REgression Analysis for Multi-way data (SCREAM) model, which combines Parallel Factor Analysis 2 and Principal Covariates Regression. SCREAM is able to extract a few latent variables from multi-way data...
This paper describes the techniques used in the submitted video presenting an interaction scenario, realised using the Neuro-Inspired Companion (NICO) robot. NICO engages the users in a personalised conversation where the robot always tracks the users' face, remembers them and interacts with them using natural language. NICO can also learn to perform tasks such as remembering and recalling objects...
Every language has different characteristics, one of which is how to pronounce the language. Pronunciation accompanied by emotional expression are increasingly making different characteristics. This research proposes the establishment of natural Indonesian viseme order influenced by the expression of emotion. This system converts the text input of an Indonesian sentence into a sequence Indonesian...
Mass casualty events caused by a biological weapon require fully capable first response teams. However, human first responders are equipped with protective gear, which limits their capabilities to complete tasks. Robots can be employed to work collaboratively with the first responders in order to augment the human's reduced abilities. The robot needs to understand and adapt to the human's workload...
Modern students have been brought up on the literature in the style of “fantasy” and computer games, so reading scientific literature and watching science films seems boring to modern students. To solve the problem of increasing students' interest in learning we offer a MOOC in form of “Quest”-style.
Recent progress on image captioning has made it possible to generate novel sentences describing images in natural language, but compressing an image into a single sentence can describe visual content in only coarse detail. While one new captioning approach, dense captioning, can potentially describe images in finer levels of detail by captioning many regions within an image, it in turn is unable to...
Visual narrative is often a combination of explicit information and judicious omissions, relying on the viewer to supply missing details. In comics, most movements in time and space are hidden in the gutters between panels. To follow the story, readers logically connect panels together by inferring unseen actions through a process called closure. While computers can now describe the content of natural...
Given a pre-registered 3D mesh sequence and accompanying phoneme-labeled audio, our system creates an animatable face model and a mapping procedure to produce realistic speech animations for arbitrary speech input. Mapping of speech features to model parameters is done using random forests for regression. We propose a new speech feature based on phonemic labels and acoustic features. The novel feature...
The selection of adequate job candidates is very long and challenging process for each employer. The system presented in this paper is aiming to decrease the time for candidate selection on the pre-employment stage using automatic personality screening based on visual, audio and lexical cues from short video-clips. The system is build to predict candidate scores of 5 Big Personality Traits and to...
This paper is part of a larger effort to detect manipulations of video by searching for and combining the evidence of multiple types of inconsistencies between the audio and visual channels. Here, we focus on inconsistencies between the type of scenes detected in the audio and visual modalities (e.g., audio indoor, small room versus visual outdoor, urban), and inconsistencies in speaker identity tracking...
Touchscreen assistive technology is designed to support speech interaction between visually disabled people and mobile devices, allowing the use of a choreography of gestures to interact with a touch user interface. This paper presents the evaluation of VoiceOver, a screen reader in Apple Inc. products, made in the research project Visually impaired users touching the screen- A user evaluation of...
Reliable visual features that encode the articulator movements of speakers can dramatically improve the decoding accuracy of automatic speech recognition systems when combined with the corresponding acoustic signals. In this paper, a novel framework is proposed to utilize audio-visual speech not only during decoding but also for training better acoustic models. In this framework, a multi-stream hidden...
Recognition of humans' emotions may be crucial in certain applications involving e.g., human-computer interaction, monitoring of elderly, understanding the affective state of learners during a course etc. To this goal and depending on the application and the environment, one may use physiological parameters (e.g., heart rate, brain activity etc.) which are typically obtrusive, or analyze other modalities...
We propose a novel method for developing static storyboard for video clips included with biomedical research literature. The technique uses both visual and audio content in the video to select candidate key frames for the storyboard. From the visual channel, the Intra-frames are extracted using FFmpeg tool. IBM Watson speech-to-text service is used to extract words from the audio channel, from which...
Set the date range to filter the displayed results. You can set a starting date, ending date or both. You can enter the dates manually or choose them from the calendar.