The Infona portal uses cookies, i.e. strings of text saved by a browser on the user's device. The portal can access those files and use them to remember the user's data, such as their chosen settings (screen view, interface language, etc.), or their login data. By using the Infona portal the user accepts automatic saving and using this information for portal operation purposes. More information on the subject can be found in the Privacy Policy and Terms of Service. By closing this window the user confirms that they have read the information on cookie usage, and they accept the privacy policy and the way cookies are used by the portal. You can change the cookie settings in your browser.
The method of Graph Cuts converts a Maximum a Posteriori (MAP) inference problem on a Markov Random Field (MRF) into a network flow, which can be solved efficiently. Many computer vision problems can be conveniently cast as an inference task to find most likely labels for pixels. The method is widely used, but computationally burdensome. Prior accelerator attempts have failed to exploit the problem's...
For a growing pool of data-intensive applications, data transfer, rather than processing speed, has emerged as the major bottleneck to performance and energy scalability. In this paper, we propose a novel interleaved logic-in-memory architecture, referred to as MISK, which leverages fine-grained integration of logic functions within dense, 2-D static random-access memory (SRAM) arrays for in-situ...
Customizing the precision of data can provide attractive trade-offs between accuracy and hardware resources. Custom hardware and FPGA designs allow bit-level control over precision, but software is typically limited by the range of types supported by the underlying processor. We propose a new form of vector computing aimed at arrays of custom-precision data on general-purpose processors with SIMD...
A hash function hashes a longer message of arbitrary length into a much shorter bit string of fixed length, called a hash. Inevitably, there will be a lot of different messages being hashed to the same or similar hash. We call this a hash collision or a partial hash collision. By utilizing multiple processors from the CUNY High Performance Computing Center's clusters, we can locate partial collisions...
Convolutional Neural Network (CNN) is a deep learning algorithm extended from Artificial Neural Network (ANN) and widely used for image classification and recognition, thanks to its invariance to distortions. The recent rapid growth of applications based on deep learning algorithms, especially in the context of Big Data analytics, has dramatically improved both industrial and academic research and...
Given n horizontal segments, each associated with a color from [σ], the Categorical Segment Stabbing problem is to find the distinct K colors stabbed by a vertical line. When the end-points of the segments are distinct and lie in [1, 2n], we present an (2 + ε)n log σ + O(n)-bit index with O(K/ε) query time, where ε∈ (0, 1].When the end-points are arbitrary real numbers, a standard reduction to the...
Open64 is an open source compiler with powerful analysis and widely used as a research and commercial development platform. However, it has not been designed and developed to realize MPI parallelization. There are many contributions in the paper. Firstly, the Open64 compiler infrastructure is showed. Secondly, the location of MPI code generation in the Open64 compiler architecture is analyzed. Thirdly,...
Matrix multiplication is one of the most widely used computational kernels in scientific computing and machine learning. Using dedicated circuit for matrix multiplication can reduce the computational time and energy consumption. Traditional matrix multipliers always adopt linear array architecture, which works inefficiently when the size of matrix sub-block is much smaller than the array length. Using...
Breadth-First Search(BFS) is one of the most fundamental graph algorithms used as a component of many graph algorithms. Our new method for distributed parallel BFS can compute BFS for one trillion vertices graph within half a second, using large supercomputers such as the K-Computer. By the use of our proposed algorithm, the K-Computer was ranked 1st in Graph500 using all the 82,944 nodes available...
We present an efficient parallel algorithm for the following problem: Given an input collection D of n sequences of total length N, a length threshold f and a mismatch threshold κ, report all κ-mismatch maximal common substrings of length at least f over all pairs of strings in D. This problem is motivated by clustering and assembly applications in computational biology, where D is a collection of...
In digital signal processors, set-associative caches achieve low miss rates for typical applications but result in significant power consumption. Set-associative caches decrease access time by probing all the data ways in parallel with the tag lookup, although the output of only the matching way is used. The power spent access the other ways is wasted. Eliminating the power consumption by performing...
Energy and power consumption are major limitations to continued scaling of computing systems. Inexactness where the quality of the solution can be traded for energy savings has been proposed as a counterintuitive approach to overcoming those limitation. However, in the past, inexactness has been necessitated the need for highly customized or specialized hardware. In order to move away from customization,...
Memory access latency continues to be a dominant bottleneck in a large class of applications on modern architectures. To optimize memory performance, it is important to utilize the locality in the memory hierarchy. Structure splitting can significantly improve memory locality. However, pinpointing inefficient code and providing insightful guidance for structure splitting is challenging. Existing tools...
Due to the concerns of two dimensional layout and structural modularity, interprocess or data transfers for VLSI arrays, such as systolic/wavefront processors, are normally achieved by way of neighborhood communication. Although interconnection networks are designed to enhance global communication for non-systolic types of processing, it is not feasible to incorporate the processors and global interconnections...
With the advent of clustered systems, more and more parallel computing is required. However a lot of programming skills is needed to write a parallel codes, especially when you want to benefit from the various parallel architectural resources, with heterogeneous units and complex memory organizations. We present in this paper a method that generates automatically, step by step, a task-parallel distributed...
A sorting algorithm is one that puts elements of a list in a certain order. It makes easy searching and locating the information. The most-used orders are numerical order and lexicographical order. An efficient sorting algorithm is that takes less time and space complexity. In this paper I make contrastive analysis of bubble sort and merge sort and tried to show why required some new approach to get...
This paper addresses the issue of designing control systems for parallel computing structures. Designing methodology described grounds on Petri nets to model computing systems of different dimensionality. Then a description of the Petri nets models (PN-models) vertex projection procedure, which allows constructing new models with differing structural and dynamical properties, is presented. Afterwards...
Computer science students use data array processing in many courses. To exploit the full power of caches and obtain higher performance, they mostly use the textbook example of sequential access of data arrays. However, a lot of discrepancies occur and the expected performance is not obtained in real life program executions, mostly due to the existence of several cache levels, with various architectures...
In this paper we propose a new degree of flexibility for soft processor design in which only the instructions relevant to the task at hand are implemented as a subset of the Instruction Set Architecture (ISA). These customized processors execute software kernels in the usual way, yet can be implemented with a fraction of the hardware resources used by other full- ISA soft processor cores. We present...
SIMD (Single Instruction Multiple Data) extension units are ubiquitous in modern processors. Array indirections raise several challenges for SIMD vectorization including disjoint memory access, unknown alignment and dependence cycle. Existing SIMD automatic vectorization methods fail to handle these challenges very well. This paper presents a new method exploiting Pure SLP (Superword Level Parallelism)...
Set the date range to filter the displayed results. You can set a starting date, ending date or both. You can enter the dates manually or choose them from the calendar.