The Infona portal uses cookies, i.e. strings of text saved by a browser on the user's device. The portal can access those files and use them to remember the user's data, such as their chosen settings (screen view, interface language, etc.), or their login data. By using the Infona portal the user accepts automatic saving and using this information for portal operation purposes. More information on the subject can be found in the Privacy Policy and Terms of Service. By closing this window the user confirms that they have read the information on cookie usage, and they accept the privacy policy and the way cookies are used by the portal. You can change the cookie settings in your browser.
As the size of Deep Neural Networks (DNNs) continues to grow to increase accuracy and solve more complex problems, their energy footprint also scales. Weight pruning reduces DNN model size and the computation by removing redundant weights. However, we implemented weight pruning for several popular networks on a variety of hardware platforms and observed surprising results. For many networks, the network...
CPU-FPGA heterogeneous platforms offer a promising solution for high-performance and energy-efficient computing systems by providing specialized accelerators with post-silicon reconfigurability. To unleash the power of FPGA, however, the programmability gap has to be filled so that applications specified in high-level programming languages can be efficiently mapped and scheduled on FPGA. The above...
This paper presents the design of a Coarse-grained Reconfigurable Architecture (CGRA), called MUSRA (Multimedia Specific Reconfigurable Architecture). The MUSRA is proposed to exploit multi-level parallelism of the computation-intensive loops in multimedia processing applications. To solve the huge bandwidth requirement of parallel processing arrays, the proposed architecture focuses on the exploitation...
Stereo matching systems that generate dense, accurate, robust and real-time disparity maps are quite attractive for a variety of applications. Most of the existing stereo matching systems that fulfill to all of these requirements adopt the Semi-Global Matching (SOM) technique. This work proposes a scalable architecture based on a systolic array, fully pipeline. The design builds on a combination of...
Modern multi-core systems employ shared memory architecture, entailing problems related to the main memory such as row-buffer conflicts, time-varying hot-spots across memory channels, and superfluous switches between reads and writes originating from different cores. There have been proposals to solve these problems by partitioning main memory across banks and/or channels such that a DRAM bank is...
Traditional processor design approaches using CISC and RISC philosophies suffer from low performance. One of alternative approaches to improve system performance is instruction level parallelism (ILP). Among the processor architectures supporting ILP, very long instruction word (VLIW) processors offer some advantages such as low power consumption and hardware complexity. In this paper, we introduce...
Large number addition is the fundamental operation in cryptography algorithms. In this paper, we accelerate large addition in hardware design by introducing non-least-positive form, which is beneficial to parallel processing. An implementation of 256-bit signed array accumulator with our method shows an improvement of 18% in speed and 15% in area-delay product compared with traditional design.
Current multi-user detection scheme for sparse code multiple access (SCMA) is iterative message passing algorithm (MPA) in which the message update strategy is in a parallel manner. To take full advantage of MPA's feature of parallelism, this letter proposes a hardware implementation strategy of max-log MPA decoder used in SCMA systems with soft baseband, which is based on general-purpose computing...
Stream join is a fundamental and computationally expensive data mining operation for relating information from different data streams. This paper presents two FPGA-based architectures that accelerate stream join processing. The proposed hardware-based systems were implemented on a multi-FPGA hybrid system with high memory bandwidth. The experimental evaluation shows that our proposed systems can outperform...
Nowadays, Graphics Processing Unit (GPU) is essential for general-purpose high-performance computing, because of its dominant performance in parallel computing compare to that of CPU. There have been many successful trials on the use of GPU in virtualized environment. Especially, NVIDIA Docker obtained a most practical way to bring GPU into the container-based virtualized environment. However, most...
Thanks to their excellent performances on typical artificial intelligence problems, deep neural networks have drawn a lot of interest lately. However, this comes at the cost of large computational needs and high power consumption. Benefiting from high precision at acceptable hardware cost on these difficult problems is a challenge. To address it, we advocate the use of ternary neural networks (TNN)...
With the help of parallelism provided by the fine-grained architecture, hardware accelerators on Field Programmable Gate Arrays (FPGAs) can significantly improve the performance of many applications. However, designers are typically required to have excellent hardware programming skills and unique optimization techniques to fully explore the potential of FPGA resources. In this work, we propose the...
Methods, algorithms and structures of neural networks were analyzed. Basic components of neural networks were defined and the principles of their development were chosen. It was shown that use of vertical-parallel method for implementation of work algorithms of neural networks basic components provides increased performance, reduce hardware costs and efficient VLSI implementation. Parallel-consequent...
Sophisticated embedded systems are increasingly used in defence, aerospace and avionic industries. They are responsible for control, collision avoidance, pilot assistance, target tracking, navigation and communications, amongst other functions. In this industrial field, High Performance Embedded Computing (HPEC) applications are becoming highly sophisticated and resource consuming for three reasons...
We present ongoing work on a platform for mobile health and implantable telemetry devices with powerful point-of-contact processing capabilities based on our VivoSoC multi-sensor medical instrumentation SoC, a custom power management IC, and only a few additional components - allowing the realisation of sub-ccm devices. We detail the powerful yet efficient acquisition and parallel processing capabilities...
High performance computing applications are far more difficult to write, therefore, practitioners expect a well-tuned software to last long and provide optimized performance even when the hardware is upgraded. It may also be necessary to write software using sufficient abstraction over the hardware so that it is capable of running on heterogeneous architecture. Therefore, it is required to have a...
Binary Edwards Curves (BEC) constitute an alternative to the standardized Weierstrass elliptic curve (EC) equations since the latter have intrinsic side channel attack vulnerabilities due to their lack of point operation uniformity. Thus, BECs have gained popularity over the past few years due to their uniformity, operation regularity, completeness and implementation attractiveness. However, BEC Scalar...
In this paper the hardware designed for the SKA Low (Square Kilometre Array) correlator and beamformer (CBF) is discussed. SKA-Low is a low frequency aperture array (LFAA) to be located in remote Western Australia. The array is collecting radio signals in the frequency range from 50 to 350 MHz. The large number of dual polarization antennas (131072) are distributed over a total of 512 stations with...
Non-blocking caches, which are commonly utilized in modern out-of-order processors, could handle multiple outstanding memory requests simultaneously to reduce the penalties of long latency cache misses. Memory level parallelism (MLP), which refers to the number of memory requests concurrently held by Miss Status Handling Registers (MSHRs), is an indispensable factor to estimate cache performance....
The matrix-matrix multiplication is an essential building block that can be found in various scientific and engineering applications. High-performance implementations of the matrix-matrix multiplication on state-of-the-art processors may be of great importance for both the vendors and the users. In this paper, we present a detailed methodology of implementing and optimizing the double-precision general...
Set the date range to filter the displayed results. You can set a starting date, ending date or both. You can enter the dates manually or choose them from the calendar.