Search results

chapter

Scalpel: Customizing DNN pruning to the underlying hardware parallelism

Jiecao Yu, Andrew Lukefahr, David Palframan, Ganesh Dasika, more

2017 ACM/IEEE 44th Annual International Symposium on Computer Architecture (ISCA) > 548 - 560

2017 ACM/IEEE 44th Annual International Symposium on Computer Architecture (ISCA)

As the size of Deep Neural Networks (DNNs) continues to grow to increase accuracy and solve more complex problems, their energy footprint also scales. Weight pruning reduces DNN model size and the computation by removing redundant weights. However, we implemented weight pruning for several popular networks on a variety of hardware platforms and observed surprising results. For many networks, the network...

chapter

Aggressive pipelining of irregular applications on reconfigurable hardware

Zhaoshi Li, Leibo Liu, Yangdong Deng, Shouyi Yin, more

2017 ACM/IEEE 44th Annual International Symposium on Computer Architecture (ISCA) > 575 - 586

2017 ACM/IEEE 44th Annual International Symposium on Computer Architecture (ISCA)

CPU-FPGA heterogeneous platforms offer a promising solution for high-performance and energy-efficient computing systems by providing specialized accelerators with post-silicon reconfigurability. To unleash the power of FPGA, however, the programmability gap has to be filled so that applications specified in high-level programming languages can be efficiently mapped and scheduled on FPGA. The above...

chapter

RTL design of a dynamically reconfigurable cell array for multimedia processing

Hung K. Nguyen, Minh T. Phan

2017 4th NAFOSTED Conference on Information and Computer Science > 189 - 194

2017 4th NAFOSTED Conference on Information and Computer Science

This paper presents the design of a Coarse-grained Reconfigurable Architecture (CGRA), called MUSRA (Multimedia Specific Reconfigurable Architecture). The MUSRA is proposed to exploit multi-level parallelism of the computation-intensive loops in multimedia processing applications. To solve the huge bandwidth requirement of parallel processing arrays, the proposed architecture focuses on the exploitation...

chapter

Hardware module for low-resource and real-time stereo vision engine using semi-global matching approach

Lucas F. S. Cambuim, Joao P. F. Barbosa, Edna N. S. Barros

2017 30th Symposium on Integrated Circuits and Systems Design (SBCCI) > 53 - 58

2017 30th Symposium on Integrated Circuits and Systems Design (SBCCI)

Stereo matching systems that generate dense, accurate, robust and real-time disparity maps are quite attractive for a variety of applications. Most of the existing stereo matching systems that fulfill to all of these requirements adopt the Semi-Global Matching (SOM) technique. This work proposes a scalable architecture based on a systolic array, fully pipeline. The design builds on a combination of...

chapter

Work as a team or individual: Characterizing the system-level impacts of main memory partitioning

Eojin Lee, Jongwook Chung, Daejin Jung, Sukhan Lee, more

2017 IEEE International Symposium on Workload Characterization (IISWC) > 156 - 166

2017 IEEE International Symposium on Workload Characterization (IISWC)

Modern multi-core systems employ shared memory architecture, entailing problems related to the main memory such as row-buffer conflicts, time-varying hot-spots across memory channels, and superfluous switches between reads and writes originating from different cores. There have been proposals to solve these problems by partitioning main memory across banks and/or channels such that a DRAM bank is...

chapter

Adaptable VLIW processor: The reconfigurable technology approach

Cuong Pham-Quoc, Binh Kieu-Do-Nguyen, Anh-Vu Dinh-Duc

2017 International Conference on Advanced Technologies for Communications (ATC) > 120 - 125

2017 International Conference on Advanced Technologies for Communications (ATC)

Traditional processor design approaches using CISC and RISC philosophies suffer from low performance. One of alternative approaches to improve system performance is instruction level parallelism (ILP). Among the processor architectures supporting ILP, very long instruction word (VLIW) processors offer some advantages such as low power consumption and hardware complexity. In this paper, we introduce...

chapter

A trick for parallel accumulation of signed array

Jinnan Ding, Shuguo Li

2017 International Conference on Electron Devices and Solid-State Circuits (EDSSC) > 1 - 2

2017 International Conference on Electron Devices and Solid-State Circuits (EDSSC)

Large number addition is the fundamental operation in cryptography algorithms. In this paper, we accelerate large addition in hardware design by introducing non-least-positive form, which is beneficial to parallel processing. An implementation of 256-bit signed array accumulator with our method shows an improvement of 18% in speed and 15% in area-delay product compared with traditional design.

chapter

Parallel-implemented message passing algorithm for SCMA decoder based on GPGPU

Yunfeng Qi, Gang Wu, Su Hu, Yuan Gao

2017 9th International Conference on Wireless Communications and Signal Processing (WCSP) > 1 - 6

2017 9th International Conference on Wireless Communications and Signal Processing (WCSP)

Current multi-user detection scheme for sparse code multiple access (SCMA) is iterative message passing algorithm (MPA) in which the message update strategy is in a parallel manner. To take full advantage of MPA's feature of parallelism, this letter proposes a hardware implementation strategy of max-log MPA decoder used in SCMA systems with soft baseband, which is based on general-purpose computing...

chapter

A generic high throughput architecture for stream processing

Christes Rousopoulos, Ektoras Karandeinos, Grigorios Chrysos, Apostolos Dollas, more

2017 27th International Conference on Field Programmable Logic and Applications (FPL) > 1 - 5

2017 27th International Conference on Field Programmable Logic and Applications (FPL)

Stream join is a fundamental and computationally expensive data mining operation for relating information from different data streams. This paper presents two FPGA-based architectures that accelerate stream join processing. The proposed hardware-based systems were implemented on a multi-FPGA hybrid system with high memory bandwidth. The experimental evaluation shows that our proposed systems can outperform...

chapter

ConVGPU: GPU Management Middleware in Container Based Virtualized Environment

Daeyoun Kang, Tae Joon Jun, Dohyeun Kim, Jaewook Kim, more

2017 IEEE International Conference on Cluster Computing (CLUSTER) > 301 - 309

2017 IEEE International Conference on Cluster Computing (CLUSTER)

Nowadays, Graphics Processing Unit (GPU) is essential for general-purpose high-performance computing, because of its dominant performance in parallel computing compare to that of CPU. There have been many successful trials on the use of GPU in virtualized environment. Especially, NVIDIA Docker obtained a most practical way to bring GPU into the container-based virtualized environment. However, most...

chapter

Scalable high-performance architecture for convolutional ternary neural networks on FPGA

Adrien Prost-Boucle, Alban Bourge, Frederic Petrot, Hande Alemdar, more

2017 27th International Conference on Field Programmable Logic and Applications (FPL) > 1 - 7

2017 27th International Conference on Field Programmable Logic and Applications (FPL)

Thanks to their excellent performances on typical artificial intelligence problems, deep neural networks have drawn a lot of interest lately. However, this comes at the cost of large computational needs and high power consumption. Benefiting from high precision at acceptable hardware cost on these difficult problems is a challenge. To address it, we advocate the use of ternary neural networks (TNN)...

chapter

PolyPC: Polymorphic parallel computing framework on embedded reconfigurable system

Hongyuan Ding, Miaoqing Huang

2017 27th International Conference on Field Programmable Logic and Applications (FPL) > 1 - 8

2017 27th International Conference on Field Programmable Logic and Applications (FPL)

With the help of parallelism provided by the fine-grained architecture, hardware accelerators on Field Programmable Gate Arrays (FPGAs) can significantly improve the performance of many applications. However, designers are typically required to have excellent hardware programming skills and unique optimization techniques to fully explore the potential of FPGA resources. In this work, we propose the...

chapter

Basic vertical-parallel real time neural network components

Ivan Tsmots, Oleksa Skorokhoda, Ihor Ignatyev, Vasyl Rabyk

2017 12th International Scientific and Technical Conference on Computer Sciences and Information Technologies (CSIT) > 1 > 344 - 347

2017 12th International Scientific and Technical Conference on Computer Sciences and Information Technologies (CSIT)

Methods, algorithms and structures of neural networks were analyzed. Basic components of neural networks were defined and the principles of their development were chosen. It was shown that use of vertical-parallel method for implementation of work algorithms of neural networks basic components provides increased performance, reduce hardware costs and efficient VLSI implementation. Parallel-consequent...

chapter

FPGA-Centric High Performance Embedded Computing: Challenges and Trends

Rabie Ben Atitallah, Karim M. A. Ali

2017 Euromicro Conference on Digital System Design (DSD) > 390 - 395

2017 Euromicro Conference on Digital System Design (DSD)

Sophisticated embedded systems are increasingly used in defence, aerospace and avionic industries. They are responsible for control, collision avoidance, pilot assistance, target tracking, navigation and communications, amongst other functions. In this industrial field, High Performance Embedded Computing (HPEC) applications are becoming highly sophisticated and resource consuming for three reasons...

chapter

Towards a Mobile Health Platform with Parallel Processing and Multi-sensor Capabilities

Florian Glaser, Philipp Schonle, Pascale Meier, Jonathan Bosser, more

2017 Euromicro Conference on Digital System Design (DSD) > 462 - 469

2017 Euromicro Conference on Digital System Design (DSD)

We present ongoing work on a platform for mobile health and implantable telemetry devices with powerful point-of-contact processing capabilities based on our VivoSoC multi-sensor medical instrumentation SoC, a custom power management IC, and only a few additional components - allowing the realisation of sub-ccm devices. We detail the powerful yet efficient acquisition and parallel processing capabilities...

chapter

An Empirical Evaluation of Design Abstraction and Performance of Thrust Framework

Ajai V. George, Sankar Manoj, Sanket Rajan Gupte, Santonu Sarkar

2017 46th International Conference on Parallel Processing Workshops (ICPPW) > 233 - 242

2017 46th International Conference on Parallel Processing Workshops (ICPPW)

High performance computing applications are far more difficult to write, therefore, practitioners expect a well-tuned software to last long and provide optimized performance even when the hardware is upgraded. It may also be necessary to write software using sufficient abstraction over the hardware so that it is capable of running on heterogeneous architecture. Therefore, it is required to have a...

chapter

A Design Strategy for Digit Serial Multiplier Based Binary Edwards Curve Scalar Multiplier Architectures

Apostolos P. Fournaris, Charalambos Dimopoulos, Odysseas Koufopavlou

2017 Euromicro Conference on Digital System Design (DSD) > 221 - 228

2017 Euromicro Conference on Digital System Design (DSD)

Binary Edwards Curves (BEC) constitute an alternative to the standardized Weierstrass elliptic curve (EC) equations since the latter have intrinsic side channel attack vulnerabilities due to their lack of point operation uniformity. Thus, BECs have gained popularity over the past few years due to their uniformity, operation regularity, completeness and implementation attractiveness. However, BEC Scalar...

chapter

Gemini FPGA hardware platform for the SKA low correlator and beamformer

E. Kooistra, G. A. Hampson, A. W. Gunst, J. D. Bunton, more

2017 XXXIInd General Assembly and Scientific Symposium of the International Union of Radio Science (URSI GASS) > 1 - 4

2017 XXXIInd General Assembly and Scientific Symposium of the International Union of Radio Science (URSI GASS)

In this paper the hardware designed for the SKA Low (Square Kilometre Array) correlator and beamformer (CBF) is discussed. SKA-Low is a low frequency aperture array (LFAA) to be located in remote Western Australia. The array is collecting radio signals in the frequency range from 50 to 350 MHz. The large number of dual polarization antennas (131072) are distributed over a total of 512 stations with...

chapter

A mechanistic model of memory level parallelism fed with cache miss rates

Qin Wang, Kecheng Ji, Ming Ling, Longxing Shi

2017 IEEE Pacific Rim Conference on Communications, Computers and Signal Processing (PACRIM) > 1 - 6

2017 IEEE Pacific Rim Conference on Communications, Computers and Signal Processing (PACRIM)

Non-blocking caches, which are commonly utilized in modern out-of-order processors, could handle multiple outstanding memory requests simultaneously to reduce the penalties of long latency cache misses. Memory level parallelism (MLP), which refers to the number of memory requests concurrently held by Miss Status Handling Registers (MSHRs), is an indispensable factor to estimate cache performance....

chapter

Towards Highly Efficient DGEMM on the Emerging SW26010 Many-Core Processor

Lijuan Jiang, Chao Yang, Yulong Ao, Wanwang Yin, more

2017 46th International Conference on Parallel Processing (ICPP) > 422 - 431

2017 46th International Conference on Parallel Processing (ICPP)

The matrix-matrix multiplication is an essential building block that can be found in various scientific and engineering applications. High-performance implementations of the matrix-matrix multiplication on state-of-the-art processors may be of great importance for both the vendors and the users. In this paper, we present a detailed methodology of implementing and optimizing the double-precision general...

INFONA - science communication portal

Search results

Scalpel: Customizing DNN pruning to the underlying hardware parallelism

Aggressive pipelining of irregular applications on reconfigurable hardware

RTL design of a dynamically reconfigurable cell array for multimedia processing

Hardware module for low-resource and real-time stereo vision engine using semi-global matching approach

Work as a team or individual: Characterizing the system-level impacts of main memory partitioning

Adaptable VLIW processor: The reconfigurable technology approach

A trick for parallel accumulation of signed array

Parallel-implemented message passing algorithm for SCMA decoder based on GPGPU

A generic high throughput architecture for stream processing

ConVGPU: GPU Management Middleware in Container Based Virtualized Environment

Scalable high-performance architecture for convolutional ternary neural networks on FPGA

PolyPC: Polymorphic parallel computing framework on embedded reconfigurable system

Basic vertical-parallel real time neural network components

FPGA-Centric High Performance Embedded Computing: Challenges and Trends

Towards a Mobile Health Platform with Parallel Processing and Multi-sensor Capabilities

An Empirical Evaluation of Design Abstraction and Performance of Thrust Framework

A Design Strategy for Digit Serial Multiplier Based Binary Edwards Curve Scalar Multiplier Architectures

Gemini FPGA hardware platform for the SKA low correlator and beamformer

A mechanistic model of memory level parallelism fed with cache miss rates

Towards Highly Efficient DGEMM on the Emerging SW26010 Many-Core Processor

Filter options

Publication date

Content availability

Keywords

Data set

INFONA - science communication portal

Search results

Add recipient

Sending message cancelled

Are you sure you want to cancel sending this message?

Send message

Filter options

Publication date

Date range setting

Set the date range to filter the displayed results. You can set a starting date, ending date or both. You can enter the dates manually or choose them from the calendar.

Content availability

Keywords

Data set

Reporting an error / abuse

Sending the report failed

Accessibility options