The Infona portal uses cookies, i.e. strings of text saved by a browser on the user's device. The portal can access those files and use them to remember the user's data, such as their chosen settings (screen view, interface language, etc.), or their login data. By using the Infona portal the user accepts automatic saving and using this information for portal operation purposes. More information on the subject can be found in the Privacy Policy and Terms of Service. By closing this window the user confirms that they have read the information on cookie usage, and they accept the privacy policy and the way cookies are used by the portal. You can change the cookie settings in your browser.
Convolutional neural networks (CNNs) are revolutionizing machine learning, but they present significant computational challenges. Recently, many FPGA-based accelerators have been proposed to improve the performance and efficiency of CNNs. Current approaches construct a single processor that computes the CNN layers one at a time; the processor is optimized to maximize the throughput at which the collection...
The trend of unsustainable power consumption and large memory bandwidth demands in massively parallel multicore systems, with the advent of the big data era, has brought upon the onset of alternate computation paradigms utilizing heterogeneity, specialization, processor-in-memory and approximation. Approximate Computing is being touted as a viable solution for high performance computation by relaxing...
Stochastic gradient descent (SGD) is one of the most popular numerical algorithms used in machine learning and other domains. Since this is likely to continue for the foreseeable future, it is important to study techniques that can make it run fast on parallel hardware. In this paper, we provide the first analysis of a technique called BUCKWILD! that uses both asynchronous execution and low-precision...
This paper presents a novel radio-resource scheduler with a hardware accelerator for coordinated scheduling in 5G ultra-high-density distributed antenna systems. In 5G mobile communications systems, the transmission weight and the overall system throughputs for a huge number of possible combinations of antennas and user equipment have to be computed. To accelerate the scheduling, the new scheduler...
This paper presents the design and prototyping of hardware and software to address the problem of rapid and reliable 3D digitization of very large collections of pinned insects. Using the collection at the Field Museum of Natural History (FMNH) as a use case, a pipeline to ingest the entire collection of 4.5 million specimens in circa 1-2 years imposes a few second limit on average processing time...
In this paper, we propose two different hardware structure of SHA-3 hash algorithm for different width of circuit interface. They both support the four functions SHA3-224/256/384/512 of SHA-3 algorithm. The padding unit of our design is also implemented by hardware instead of software. Besides, a 3-round-in-1 structure is proposed to speed up the throughput of our circuit. We conduct an implementation...
In today's high performance computing (HPC) environments, analyzing and predicting the performance of multiple-processor systems (clusters cores) on critical workloads remains a challenge. This is as a result of the key metrics that influences system's behavior. Busty arrivals in HPCs demand either a shared memory-parallel architecture or pipelined dataflow architecture. At present, a processor model...
High Performance Computing systems offer excellent metrics for speed and efficiency when using bare metal hardware, a high speed interconnect, and parallel applications. In contrast cloud computing has provided management and implementation flexibility at a cost of performance. We therefore suggest two approaches to make HPC resources available in a dynamically reconfigurable hybrid HPC/Cloud architecture...
The Keyed-Hash Message Authentication Codes(HMAC) is a useful mechanism for message authentication. In this paper, a high-performance HMAC/SHA-3 processor which can generate HMAC message digest and hash message digest is presented. Not only the standard length (224,256,384,512) of the message digest can be generated, but also a length of 64-bit message digest. Due to the application of new generation...
Key-value stores (e.g., Memcached) and web servers (e.g., NGINX) are widely used by cloud providers. As interactive services, they have strict service-level objectives, with typical 99th-percentile tail latencies on the order of a few milliseconds. Unlike average latency, tail latency is more sensitive to changes in usage load and traffic patterns, system configurations, and resource availability...
Current networks are changing very fast. Network administrators need more flexible and powerful tools to be able to support new protocols or services very fast. The P4 language provides new level of abstraction for flexible packet processing. Therefore, we have designed new architecture for memory efficient mapping of P4 match/action tables to FPGA. The architecture is based on DCFL algorithm and...
Stream join is a fundamental and computationally expensive data mining operation for relating information from different data streams. This paper presents two FPGA-based architectures that accelerate stream join processing. The proposed hardware-based systems were implemented on a multi-FPGA hybrid system with high memory bandwidth. The experimental evaluation shows that our proposed systems can outperform...
HPC interconnect is a very crucial component of any HPC machine. Interconnect performance is one of the contributing factors for overall performance of HPC system. Most popular interface to connect Network Interface Card (NIC) to CPU is PCI express (PCIe). With denser core counts in compute servers and increasingly maturing fabric interconnect speeds, there is need to maximize the packet data movement...
Lightweight block ciphers are an important topic of research in the context of the Internet of Things (IoT). Current cryptographic contests and standardization efforts seek to benchmark lightweight ciphers in both hardware and software. Although there have been several benchmarking studies of both hardware and software implementations of lightweight ciphers, direct comparison of hardware and software...
Thanks to their excellent performances on typical artificial intelligence problems, deep neural networks have drawn a lot of interest lately. However, this comes at the cost of large computational needs and high power consumption. Benefiting from high precision at acceptable hardware cost on these difficult problems is a challenge. To address it, we advocate the use of ternary neural networks (TNN)...
Model of Turbo-Product Codes decoder architecture and method for construction of Turbo-Product Codes decoder are proposed in the paper. The model describes decoder functioning taking into account limitations of hardware platform and proposes re-use of components in the decoding process. The method provides set of steps for decoder implementation. Field-Programmable Gate Arrays circuits are selected...
A deep learning processor with 8 gated recurrent neural network (RNN) accelerators is proposed in this paper. It features on-chip incremental learning by numerical and local gradient computation enhancement. Extra precision of training is obtained without extending the bit-width. Tri-mode weight access (DMA/FIFO/RAM) improves the throughput during incremental learning. The number multipliers and activation...
The degree to which Turbo-Code decoder architectures can be parallelized is constrained by requirements for flexibility with respect to code block sizes and code rates. At the same time throughput requirements are expected to increase by a factor of up to 20x for 5G networks, which are currently undergoing standardization. The limiting factors for the throughput of a Turbo-Code decoder are maximum...
Reducing the configuration time of portions of an FPGA at run time is crucial in contemporary FPGA-based accelerators. In this work, we propose a method to increase the throughput for FPGA dynamic partial reconfiguration by using standard IP blocks. The throughput is increased by over-clocking the configuration bitstream circuitry beyond the limits stated in the specifications of these standard blocks...
Laser triangulation applications are commonly used for industrial quality control. Such algorithms require real-time systems often made of a computing unit close to the image sensor through a short and fast link. Choosing a camera with integrated Field Programmable Gate Array (FPGA) as the computing unit can provide high pipeline and parallel computing adapted to process image in real-time. Moreover,...
Set the date range to filter the displayed results. You can set a starting date, ending date or both. You can enter the dates manually or choose them from the calendar.