The Infona portal uses cookies, i.e. strings of text saved by a browser on the user's device. The portal can access those files and use them to remember the user's data, such as their chosen settings (screen view, interface language, etc.), or their login data. By using the Infona portal the user accepts automatic saving and using this information for portal operation purposes. More information on the subject can be found in the Privacy Policy and Terms of Service. By closing this window the user confirms that they have read the information on cookie usage, and they accept the privacy policy and the way cookies are used by the portal. You can change the cookie settings in your browser.
As the size of Deep Neural Networks (DNNs) continues to grow to increase accuracy and solve more complex problems, their energy footprint also scales. Weight pruning reduces DNN model size and the computation by removing redundant weights. However, we implemented weight pruning for several popular networks on a variety of hardware platforms and observed surprising results. For many networks, the network...
Convolutional Neural Networks are being studied to provide features such as real time image recognition. One of the key operations to support HW implementations of this type of network is the multiplication. Despite the high number of operations required by Convolutional Neural Networks, they became feasible in the past years due the high availability of computing power, present on devices such as...
The stencil pattern is important in many scientific and engineering domains, spurring great interest from researchers and industry. In recent years, various optimizations have been proposed for parallel stencil applications running on GPUs. However, most of the runtime systems that execute those applications often fail to fully utilize the parallelism of modern heterogeneous systems. In this paper,...
High precision results in structural with the shortest time consumption are expected when methods are introduced to solve FEM(Finite element method). Solving of stiffness matrix assembled by isoparametric elements and solving the assembled stiffness matrix are the most time-consuming. In the previous serial algorithms, there is always a time limitation for some applications and it is hard to achieve...
Scientific simulations typically store only a small fraction of computed timesteps due to storage and I/O bandwidth limitations. Previous work has demonstrated the compressibility of floating-point volume data, but such compression often comes with a tradeoff between computational complexity and the achievable compression ratio. This work demonstrates the use of special-purpose video encoding hardware...
Modern computer systems are accelerator-rich, equipped with many types of hardware accelerators to speed up computation. For example, graphics processing units (GPUs) are a type of accelerators that are widely employed to accelerate parallel workloads. In order to well utilize different accelerators to gain better execution time speedup or reduce total energy consumption, many scheduling algorithms...
More than 64% of schools in Apurimac Peru are located in rural area; unlike the schools in urban area, they present problem of lack of Internet connection, hence, without access to virtual learning and they can't use Educational Resources. This paper describes the design, development and testing of the Mini Learning Management System for schools without internet connection to improve de availability...
Current multi-user detection scheme for sparse code multiple access (SCMA) is iterative message passing algorithm (MPA) in which the message update strategy is in a parallel manner. To take full advantage of MPA's feature of parallelism, this letter proposes a hardware implementation strategy of max-log MPA decoder used in SCMA systems with soft baseband, which is based on general-purpose computing...
Traditionally GPUs focused on streaming, data-parallel applications, with little data reuse or sharing and coarse-grained synchronization. However, the rise of general-purpose GPU (GPGPU) computing has made GPUs desirable for applications with more general sharing patterns and fine-grained synchronization, especially for recent GPUs that have a unified address space and coherent caches. Prior work...
In this paper, we present the first multilevel implementation of the Harris-Stephens corner detector and the ORB feature extractor running on FPGA hardware, for computer vision and robotics applications. ORB is a fundamental component of many robotics applications, and requires significant computation. The design has been validated both in behavioural simulation and in implementation on an Arria V...
Nowadays, Graphics Processing Unit (GPU) is essential for general-purpose high-performance computing, because of its dominant performance in parallel computing compare to that of CPU. There have been many successful trials on the use of GPU in virtualized environment. Especially, NVIDIA Docker obtained a most practical way to bring GPU into the container-based virtualized environment. However, most...
Our earlier work on support vector machines (SVM) and ultrasonic flaw detection algorithms demonstrated i) highly accurate classifier performance and ii) the feasibility of the algorithm for real-time implementation on low-cost embedded systems with graphical processing units (GPU) and CUDA library (a parallel computing platform and programming model) support. This works extends the implementation...
The performance of commodity video-gaming embedded devices (consoles, graphics cards, tablets, etc.) has been advancing at a rapid pace owing to strong consumer demand and stiff market competition. Gaming devices are currently amongst the most powerful and cost-effective computational technologies available in quantity. In this article, we evaluate a sample of current generation video-gaming devices...
The numerical treatment of variational problems gives rise to large sparse matrices, which are typically assembled by coalescing elementary contributions. As the explicit matrix form is required by numerical solvers, the assembly step can be a potential bottleneck, especially in implicit and time dependent settings where considerable updates are needed. On standard HPC platforms, this process can...
We have designed and implemented an ultrasonic imaging research platform that performs all signal processing, including beamforming and image processing, using software on a GPU. An operating software is developed on a PC that can control RF data acquisition hardware to accommodate ultrasound images of various formats. Beamforming methods that include conventional scan line based imaging, scan line-based...
We have designed and implemented an ultrasonic imaging research platform that performs all signal processing including beamforming, using software on a GPU. Software-based approach on the GPU is expected to reduce the hardware complexity and offer the advantages of flexibility and rapid implementation even if there is any future change in the requirements for ultrasound imaging applications. An operating...
With the growing importance of deep learning and energy-saving approximate computing, half precision floating point arithmetic (FP16) is fast gaining popularity. Nvidia's recent Pascal architecture was the first GPU that offered FP16 support. However, when actual products were shipped, programmers soon realized that a naïve replacement of single precision (FP32) code with half precision led to disappointing...
General-purpose workloads running on modern graphics processing units (GPGPUs) rely on hardware-based barriers to synchronize warps within a thread block (TB). However, imbalance may exist before reaching a barrier if a GPGPU workload contains irregular memory accesses, i.e., some warps may be critical while others may not. Ideally, cache space should be reserved for the critical warps. Unfortunately,...
Accurate, real-time Automatic Speech Recognition (ASR) comes at a high energy cost, so accuracy has often to be sacrificed in order to fit the strict power constraints of mobile systems. However, accuracy is extremely important for the end-user, and today's systems are still unsatisfactory for many applications. The most critical component of an ASR system is the acoustic scoring, as it has a large...
Work-queue is an effective approach for mapping irregular-parallel workloads to GPGPUs. It can improve the utilization of SIMD units by only processing useful works which are dynamically generated during execution. As current GPGPUs lack necessary supports for work-queues, a software-based work-queue implementation often suffers from memory contention and load balancing issues. We present a novel...
Set the date range to filter the displayed results. You can set a starting date, ending date or both. You can enter the dates manually or choose them from the calendar.