The Infona portal uses cookies, i.e. strings of text saved by a browser on the user's device. The portal can access those files and use them to remember the user's data, such as their chosen settings (screen view, interface language, etc.), or their login data. By using the Infona portal the user accepts automatic saving and using this information for portal operation purposes. More information on the subject can be found in the Privacy Policy and Terms of Service. By closing this window the user confirms that they have read the information on cookie usage, and they accept the privacy policy and the way cookies are used by the portal. You can change the cookie settings in your browser.
Owing to the advantages of low standby power and high scalability, ReRAM technology is considered as a promising replacement for conventional DRAM in future manycore systems. In order to make ReRAM highly scalable, the memory array has to have a crossbar array structure, which needs a specific access mechanism for activating a row of memory when reading/writing a data block from/to it. This type of...
High-level synthesis (HLS) is well capable of generating control and computation circuits for FPGA accelerators, but still requires sufficient human effort to tackle the challenge of memory and communication bottlenecks. One important approach for improving data locality is to apply loop tiling on memory-intensive loops. Loop tiling is a well-known compiler technique that partitions the iteration...
Some modern high-level synthesis (HLS) tools [1] permit the synthesis of multi-threaded software into parallel hardware, where concurrent software threads are realized as concurrently operating hardware units. A common performance bottleneck in any parallel implementation (whether it be hardware or software) is memory bandwidth — parallel threads demand concurrent access to memory resulting in contention...
Content Addressable Memories (CAMs) have found widespread use in applications that require high speed search capabilities. Each cell in the CAM array is associated with a storage unit and a comparator logic. Due to the various customized features in the CAM implementations, creation of an automated BIST solution for testing them has presented unique challenges. This paper shows that, with suitable...
Clustering is a crucial tool for analyzing data in virtually every scientific and engineering discipline. The U.S. National Academy of Sciences (NAS) has recently announced "the seven giants of statistical data analysis" in which data clustering plays a central role [1]. This research also emphasizes that more scalable solutions are required to enable time and space clustering for the future...
For a growing pool of data-intensive applications, data transfer, rather than processing speed, has emerged as the major bottleneck to performance and energy scalability. In this paper, we propose a novel interleaved logic-in-memory architecture, referred to as MISK, which leverages fine-grained integration of logic functions within dense, 2-D static random-access memory (SRAM) arrays for in-situ...
With the coming of ‘Big Data‘ era, high-energy-efficiency database is demanded for the Internet of things (IoT) application scenarios. The emerging Resistive Random Access Memory (RRAM) has been considered as an energy-efficient replacement of DRAM for next-generation main memory. In this paper, we propose an RRAM-based SQL query unit with process-in-memory characteristic. A storage structure for...
A 4 kb fully differential 8-port SRAM bitcell array (6 read ports and 2 write ports) is presented in this paper. This 8-port SRAM provides simultaneous access, high system throughput and a great read static noise margin by isolating the read ports from storage nodes. At 0.4 V supply voltage, designed 8-port SRAM bitcell shows 123, 137 and 123 mV static noise margin during read, write and standby modes,...
Computing machine learning models in the cloud remains a central problem in big data analytics. In this work, we introduce a cloud analytic system exploiting a parallel array DBMS based on a classical shared-nothing architecture. Our approach combines in-DBMS data summarization with mathematical processing in an external program. We study how to summarize a data set in parallel assuming a large number...
Graph algorithms such as breadth-first search (BFS) have been gaining ever-increasing importance in the era of Big Data. However, the memory bandwidth remains the key performance bottleneck for graph processing. To address this problem, we utilize processing-in-memory (PIM), combined with non-volatile metal-oxide resistive random access memory (ReRAM), to improve the performance of both computation...
Memories are currently a real bottleneck to design high speed and energy-efficient systems-on-chip. A significant increase of the performance gap between processors and memories is observed. On the other hand, an important proportion of total power is spent on memory systems due to the increasing trend of embedding volatile memory into systems-on-chip. For these reasons, STT-MRAM (Spin-Transfer Torque...
In this paper, we architect large-scale SRAM arrays with monolithic 3D (M3D) integration technology. We introduce M3D-based SRAM arrays with three different ways of integration: M3D-R (vertical routing-only), M3D-VBL (vertical bitline), and M3D-VWL (vertical wordline). We also apply M3D-based SRAM arrays to last-level caches: tag arrays for eDRAM LLCs and data arrays for SRAM LLCs. The proposed LLCs...
An energy-efficient and high-speed stereo matching processor is proposed for smart mobile devices with proposed stereo SRAM (S-SRAM) and independent regional integral cost (IRIC). Cost generation unit (CGU) with the proposed S-SRAM reduces 63.2% of CGU power consumption. The proposed IRIC enables cost aggregation unit (CAU) to obtain 6.4× of speed and 12.3% of the power reduction of CAU with pipelined...
Latent Dirichlet allocation (LDA) based topic inference is a data classification method, that is used efficiently for extremely large data sets. However, the processing time is very large due to the serial computational behavior of the Markov Chain Monte Carlo method used for the topic inference. We propose a pipelined hardware architecture and memory allocation scheme to accelerate LDA using parallel...
The power consumed by memory system in GPUs is a significant fraction of the total chip power. As thread level parallelism increases, GPUs are likely to stress cache and memory bandwidth even more, thereby exacerbating power consumption. We observe that neighboring concurrent thread arrays (CTAs) within GPU applications share considerable amount of data. However, the default GPU scheduling policy...
Processing In Memory (PIM), the concept of integrating processing directly with memory, has been attracting a lot of attention since PIM can assist in overcoming the throughput limitation caused by data movement between CPU and memory. The challenge, however, is that it requires the programmers to have a deep understanding of the PIM architecture to maximize the benefits such as data locality and...
The purpose of this study is to quantitatively assess the performance of graph processing algorithms for large scale-free graphs residing in byte-addressable Non-Volatile Memory (NVM). Our study focuses on static and dynamic graph algorithms previously optimized for external memory in the form of locally attached NAND Flash arrays, with data structures tuned to maximize locality. The evaluation is...
Breadth-first search is a building block of many graph algorithms. Because BFS is memory-bound, parallelizing BFS on a multi-core computer must consider issues of data hazards, effects of atomic operations on memory throughput, and the size of the last level cache. Additionally, graph algorithms must cope with non-sequential memory access, which defeats cache prefetching and leads to a high cache...
Spin-Transfer Torque RAM (STT-RAM) has a higher density than SRAM and non-volatility, and is expected to be used as the last-level cache (LLC) of a microprocessor. One technical issue is that, since the energy cost of write access requests for an STT-RAM LLC is expensive, the total energy consumption of the STT-RAM LLC may increase for some write-intensive applications. Therefore, this paper proposes...
Non-volatile Memories (NVMs), such as PCM and ReRAM, have been widely proposed for future main memory design because of their low standby power, high storage density, fast access speed. However, these NVMs suffer from the write endurance problem. In order to prevent a malicious program from wearing out NVMs deliberately, researchers have proposed various wear-leveling methods, which remap logical...
Set the date range to filter the displayed results. You can set a starting date, ending date or both. You can enter the dates manually or choose them from the calendar.