The Infona portal uses cookies, i.e. strings of text saved by a browser on the user's device. The portal can access those files and use them to remember the user's data, such as their chosen settings (screen view, interface language, etc.), or their login data. By using the Infona portal the user accepts automatic saving and using this information for portal operation purposes. More information on the subject can be found in the Privacy Policy and Terms of Service. By closing this window the user confirms that they have read the information on cookie usage, and they accept the privacy policy and the way cookies are used by the portal. You can change the cookie settings in your browser.
Omnidirectional, also referred to as 360º, visual content provides an immersive experience since it allows users to view a visual scene from different directions. The overall content typically covers a full sphere, and omnidirectional videos or images are processed to obtain a projection on a 2D plane of a fraction of the sphere (aka viewport), which is shown to the user. Therefore, users can look...
Since news videos are valuable sources of multimedia information on real-world events, there is a demand for viewing them efficiently. However, there is a problem that summarization methods based on auditory contents do not take into account the visual contents. In the case of news videos, due to its presentation style where audio contents and visual contents do not necessarily come from the same...
Multimodal sentiment analysis involves identifying sentiment in videos and is a developing field of research. Unlike current works, which model utterances individually, we propose a recurrent model that is able to capture contextual information among utterances. In this paper, we also introduce attentionbased networks for improving both context learning and dynamic feature fusion. Our model shows...
[Background]: Developing conceptual models is an integral part of the requirements engineering (RE) process. Goal models are requirements engineering conceptual models that allow diagrammatic representation of stakeholder intentions and how they affect each other. A specific goal modeling language construct, the contribution of goal satisfaction of one goal to another, plays a central role in supporting...
Convolutional neural network (CNN) based trackers have achieved significant performances in tracking recently. Most existing CNN-based trackers regard tracking as a classification or similarity searching problem. The two methods have their respective superiorities and limitations because of different supervised objectives. In this paper, we propose a multi-task CNN for visual tracking, not only fully...
The popularly used subjective estimator- mean opinion score (MOS) is often biased by the testing environment, viewers mode, domain expertise, and many other factors that may actively influence on actual assessment. We therefore, devise a no- reference subjective quality assessment metric by exploiting the nature of human eye browsing on videos. The participants' eye-tracker recorded gaze-data indicate...
We present a framework to analyze various aspects of models for video question answering (VideoQA) using customizable synthetic datasets, which are constructed automatically from gameplay videos. Our work is motivated by the fact that existing models are often tested only on datasets that require excessively high-level reasoning or mostly contain instances accessible through single frame inferences...
Many instructors consider a programming environment replete with a large variety of interactive objects and commands a liability for teaching introductory programming. In fact, such an environment is an important pedagogical tool, whose instructional capabilities can be amplified by discovery-based programming praxes — predefined programs with embedded comments that instruct students to browse and...
Mid-level representations are used to map sets of local features into one global representation for a given media descriptor. In visual pattern recognition tasks, Bag-of-Words (BoW) is one popular strategy, among many methods available in literature, due mainly by the simplicity in concept and implementation. Despite the overall good results achieved by BoW in many tasks, the method is unstable in...
Learning visual representations with self-supervised learning has become popular in computer vision. The idea is to design auxiliary tasks where labels are free to obtain. Most of these tasks end up providing data to learn specific kinds of invariance useful for recognition. In this paper, we propose to exploit different self-supervised approaches to learn representations invariant to (i) inter-instance...
We propose ‘Hide-and-Seek’, a weakly-supervised framework that aims to improve object localization in images and action localization in videos. Most existing weakly-supervised methods localize only the most discriminative parts of an object rather than all relevant parts, which leads to suboptimal performance. Our key idea is to hide patches in a training image randomly, forcing the network to seek...
In this paper, we address the problem of spatio-temporal person retrieval from videos using a natural language query, in which we output a tube (i.e., a sequence of bounding boxes) which encloses the person described by the query. For this problem, we introduce a novel dataset consisting of videos containing people annotated with bounding boxes for each second and with five natural language descriptions...
Humans infer rich knowledge of objects from both auditory and visual cues. Building a machine of such competency, however, is very challenging, due to the great difficulty in capturing large-scale, clean data of objects with both their appearance and the sound they make. In this paper, we present a novel, open-source pipeline that generates audiovisual data, purely from 3D object shapes and their...
This article shares the obtained results in the teacher training phase for the analysis, development and publication of accessible courses making use of the learning management platform: ATutor. This training phase is carried out within the framework of a research project: “Didactic and technological development in teaching scenarios for the training of teachers who welcome diversity: factors for...
This paper focuses on temporal localization of actions in untrimmed videos. Existing methods typically train classifiers for a pre-defined list of actions and apply them in a sliding window fashion. However, activities in the wild consist of a wide combination of actors, actions and objects; it is difficult to design a proper activity list that meets users’ needs. We propose to localize activities...
In this paper, we develop deep spatio-temporal neural networks to sequentially count vehicles from low quality videos captured by city cameras (citycams). Citycam videos have low resolution, low frame rate, high occlusion and large perspective, making most existing methods lose their efficacy. To overcome limitations of existing methods and incorporate the temporal information of traffic video, we...
This paper addresses the problem of joint detection and recounting of abnormal events in videos. Recounting of abnormal events, i.e., explaining why they are judged to be abnormal, is an unexplored but critical task in video surveillance, because it helps human observers quickly judge if they are false alarms or not. To describe the events in the human-understandable form for event recounting, learning...
Detecting logo frequency and duration in sports videos provides sponsors an effective way to evaluate their advertising efforts. However, general-purposed object detection methods cannot address all the challenges in sports videos. In this paper, we propose a mutual-enhanced approach that can improve the detection of a logo through the information obtained from other simultaneously occurred logos...
Anticipating human intention by observing one’s actions has many applications. For instance, picking up a cellphone, then a charger (actions) implies that one wants to charge the cellphone (intention) (Fig. 1). By anticipating the intention, an intelligent system can guide the user to the closest power outlet. We propose an on-wrist motion triggered sensing system for anticipating daily intentions,...
Object-to-camera motion produces a variety of apparent motion patterns that significantly affect performance of short-term visual trackers. Despite being crucial for designing robust trackers, their influence is poorly explored in standard benchmarks due to weakly defined, biased and overlapping attribute annotations. In this paper we propose to go beyond pre-recorded benchmarks with post-hoc annotations...
Set the date range to filter the displayed results. You can set a starting date, ending date or both. You can enter the dates manually or choose them from the calendar.