Search results

Items from 1 to 20 out of 725 results

chapter

Congestion-aware memory management on NUMA platforms: A VMware ESXi case study

Jagadish B. Kotra, Seongbeom Kim, Kamesh Madduri, Mahmut T. Kandemir

2017 IEEE International Symposium on Workload Characterization (IISWC) > 146 - 155

2017 IEEE International Symposium on Workload Characterization (IISWC)

He VMware ESXi hypervisor attracts a wide range of customers and is deployed in domains ranging from desktop computing to server computing. While the software systems are increasingly moving towards consolidation, hardware has already transitioned into multi-socket Non-Uniform Memory Access (NUMA)-based systems. The marriage of increasing consolidation and the multi-socket based systems warrants low-overhead,...

chapter

A compilation method for zero overhead loop in DSPs with VLIW

Rui Chang, Jun Wu, Haoqi Ren

2017 9th International Conference on Wireless Communications and Signal Processing (WCSP) > 1 - 7

2017 9th International Conference on Wireless Communications and Signal Processing (WCSP)

The increasing use of digital signal processors (DSPs) in wireless communications and signal processing necessitates the optimization of compilers to support special hardware features. In this paper, we propose a compiler transformation method for zero overhead loop (ZOL). It supports very long instruction word (VLIW), internal branches and the loops whose iterative times are known at runtime and...

chapter

Exploring thread-level parallelism based on cost-driven model for irregular programs

Yuancheng Li, Bin Liu

2017 IEEE International Conference on Signal Processing, Communications and Computing (ICSPCC) > 1 - 6

2017 IEEE International Conference on Signal Processing, Communications and Computing (ICSPCC)

Speculative Multithreading (SpMT) technology is an effective mechanism for automatic parallelization of irregular programs. For a sequential program, they can be executed speculatively in parallel by speculating that many data dependences are unlikely during runtime. Although speculative parallelization can potentially deliver significant speedup, several overheads associated with this technique can...

chapter

Automatic, Abstracted and Portable Topology-Aware Thread Placement

Jens Gustedt, Emmanuel Jeannot, Farouk Mansouri

2017 IEEE International Conference on Cluster Computing (CLUSTER) > 389 - 399

2017 IEEE International Conference on Cluster Computing (CLUSTER)

Efficiently programming shared-memory machines is a difficult challenge because mapping application threads onto the memory hierarchy has a strong impact on the performance. However, optimizing such thread placement is difficult: architectures become increasingly complex and application behavior changes with implementations and input parameters, e.g problem size and number of threads. In this work,...

chapter

A Software Solution for Hardware Vulnerabilities

Komail Dharsee, Ethan Johnson, John Criswell

2017 IEEE Cybersecurity Development (SecDev) > 27 - 33

2017 IEEE Cybersecurity Development (SecDev)

Modern processors are becoming increasingly complex with features that improve performance and add new functionality. However, such improvements are a double-edged sword: they improve performance and functionality but also introduce security-critical bugs into the processor that attackers can leverage to bypass a system's security policies. Existing solutions require hardware extensions and often...

chapter

Exploiting half precision arithmetic in Nvidia GPUs

Nhut-Minh Ho, Weng-Fai Wong

2017 IEEE High Performance Extreme Computing Conference (HPEC) > 1 - 7

2017 IEEE High Performance Extreme Computing Conference (HPEC)

With the growing importance of deep learning and energy-saving approximate computing, half precision floating point arithmetic (FP16) is fast gaining popularity. Nvidia's recent Pascal architecture was the first GPU that offered FP16 support. However, when actual products were shipped, programmers soon realized that a naïve replacement of single precision (FP32) code with half precision led to disappointing...

chapter

RCU-HTM: Combining RCU with HTM to Implement Highly Efficient Concurrent Binary Search Trees

Dimitrios Siakavaras, Konstantinos Nikas, Georgios Goumas, Nectarios Koziris

2017 26th International Conference on Parallel Architectures and Compilation Techniques (PACT) > 1 - 13

2017 26th International Conference on Parallel Architectures and Compilation Techniques (PACT)

In this paper we introduce RCU-HTM, a technique that combines Read-Copy-Update (RCU) with Hardware Transactional Memory (HTM) to implement highly efficient concurrent Binary Search Trees (BSTs). Similarly to RCU-based algorithms, we perform the modifications of the tree structure in private copies of the affected parts of the tree rather than in-place. This allows threads that traverse the tree to...

chapter

Implementing an ISR defense on a MIPS architecture

Loriana Sanabria Sancho, Elena Gabriela Barrantes

2017 XLIII Latin American Computer Conference (CLEI) > 1 - 7

2017 XLIII Latin American Computer Conference (CLEI)

Code injection attacks are an undeniable threat in today's cyberworld. Instruction Set Randomization (ISR) was initially proposed in 2003. This technique was designed to protect systems against code injection attacks by creating an unique instruction set for each machine, thanks to randomization. It is a promising technique in the growing embedded system and Internet of Things (IoT) devices ecosystem,...

chapter

FPGA-based CNN inference accelerator synthesized from multi-threaded C software

Jin Hee Kim, Brett Grady, Ruolong Lian, John Brothers, more

2017 30th IEEE International System-on-Chip Conference (SOCC) > 268 - 273

2017 30th IEEE International System-on-Chip Conference (SOCC)

A deep-learning inference accelerator is synthesized from a C-language software program parallelized with Pthreads. The software implementation uses the well-known producer/consumer model with parallel threads interconnected by FIFO queues. The LegUp high-level synthesis (HLS) [1] tool synthesizes threads into parallel FPGA hardware, translating software parallelism into spatial parallelism. A complete...

chapter

A Software-Hardware Co-designed Methodology for Efficient Thread Level Speculation

Qiong Wang, Jialong Wang, Li Shen, Zhiying Wang

2017 IEEE International Conference on Computer and Information Technology (CIT) > 184 - 191

2017 IEEE International Conference on Computer and Information Technology (CIT)

Thread-Level Speculation (TLS) mechanism has been extensively studied due to its capability of simplifying parallel programming and achieving effective performance speedup. In this paper, we investigate the study of improving current TLS models for high efficiency on present multi-core architectures. Particularly, we propose a new TLS model called Cache Copy-on-Write (CCoW). The main features of our...

chapter

Developing CPU-GPU Embedded Systems Using Platform-Agnostic Components

Gabriel Campeanu, Jan Carlson, Severine Sentilles

2017 43rd Euromicro Conference on Software Engineering and Advanced Applications (SEAA) > 176 - 180

2017 43rd Euromicro Conference on Software Engineering and Advanced Applications (SEAA)

Nowadays, there are many embedded systems with different architectures that have incorporated GPUs. However, it is difficult to develop CPU-GPU embedded systems using component-based development (CBD), since existing CBD approaches have no support for GPU development. In this context, when targeting a particular CPU-GPU platform, the component developer is forced to construct hardware-specific components,...

chapter

Towards Highly Efficient DGEMM on the Emerging SW26010 Many-Core Processor

Lijuan Jiang, Chao Yang, Yulong Ao, Wanwang Yin, more

2017 46th International Conference on Parallel Processing (ICPP) > 422 - 431

2017 46th International Conference on Parallel Processing (ICPP)

The matrix-matrix multiplication is an essential building block that can be found in various scientific and engineering applications. High-performance implementations of the matrix-matrix multiplication on state-of-the-art processors may be of great importance for both the vendors and the users. In this paper, we present a detailed methodology of implementing and optimizing the double-precision general...

chapter

Application-Aware Power Coordination on Power Bounded NUMA Multicore Systems

Rong Ge, Pengfei Zou, Xizhou Feng

2017 46th International Conference on Parallel Processing (ICPP) > 591 - 600

2017 46th International Conference on Parallel Processing (ICPP)

Power is a critical factor that limits the performance and scalability of modern high performance computer systems. Considering power as a first-order constraint and a scarce system resource, power-bounded computing represents a new perspective to address the power challenge in HPC.In this work we present an application-aware, multi-dimensional power allocation framework to support power-bounded parallel...

chapter

A Region-Aware Multi-Objective Auto-Tuner for Parallel Programs

Klaus Kofler, Juan J. Durillo, Philipp Gschwandtner, Thomas Fahringer

2017 46th International Conference on Parallel Processing Workshops (ICPPW) > 190 - 199

2017 46th International Conference on Parallel Processing Workshops (ICPPW)

Auto-tuning has become increasingly popular for optimizing non-functional parameters of parallel programs. The typically large search space requires sophisticated techniques to find well performing parameter values in a reasonable amount of time. Different parts of a program often perform best with different parameter values. We therefore subdivide programs into several regions, and try to optimize...

chapter

A Novel Minimum Time Parallel 2-D Discrete Wavelet Transform Algorithm for General Purpose Processors

Eduardo Moscoso Rubino, Alberto Jose Alvares, Raul Marin Prades, Pedro Sanz Valero

2017 46th International Conference on Parallel Processing (ICPP) > 553 - 562

2017 46th International Conference on Parallel Processing (ICPP)

A novel efficient inplace, multithreaded, and cachefriendly parallel 2-D wavelet transform algorithm based on the lifting transform is introduced. In order to maximize the cache utilization and consequently minimize the memory bus bandwidth use, the threads compete to work on a small memory area maximizing the chance of finding it in the cache and their synchronization is done with very low overhead...

chapter

Autotuning GPU Kernels via Static and Predictive Analysis

Robert Lim, Boyana Norris, Allen Malony

2017 46th International Conference on Parallel Processing (ICPP) > 523 - 532

2017 46th International Conference on Parallel Processing (ICPP)

Optimizing the performance of GPU kernels is challenging for both human programmers and code generators. For example, CUDA programmers must set thread and block parameters for a kernel, but might not have the intuition to make a good choice. Similarly, compilers can generate working code, but may miss tuning opportunities by not targeting GPU models or performing code transformations. Although empirical...

chapter

Understanding the Impact of Fine-Grained Data Sharing and Thread Communication on Heterogeneous Workload Development

Tuan Ta, David Troendle, Xiaoqi Hu, Byunghyun Jang

2017 16th International Symposium on Parallel and Distributed Computing (ISPDC) > 132 - 139

2017 16th International Symposium on Parallel and Distributed Computing (ISPDC)

The conventional OpenCL 1.x style CPU-GPU heterogeneous computing paradigm treats the CPU and GPU processors as loosely connected separate entities. At best each executes independent tasks, but, more commonly, the CPU idles while waiting for results from the GPU. No data-sharing and communications are allowed during kernel execution. This model limits the number of applications that can harness the...

chapter

Invited paper: Secure swarm intelligence: A new approach to many-core power management

Augusto Vega, Alper Buyuktosunoglu, Pradip Bose

2017 IEEE/ACM International Symposium on Low Power Electronics and Design (ISLPED) > 1 - 6

2017 IEEE/ACM International Symposium on Low Power Electronics and Design (ISLPED)

This paper presents a visionary proposal for a distributed (or decentralized) power/thermal control mechanism that applies the bio-inspired artificial intelligence paradigm of swarm intelligence. The target use case is a future many-core processor. The paper reports a high-level concept-phase specification of the proposed solution approach in a research setting. The emphasis is on highlighting the...

chapter

OpenMP device offloading to FPGA accelerators

Lukas Sommer, Jens Korinth, Andreas Koch

2017 IEEE 28th International Conference on Application-specific Systems, Architectures and Processors (ASAP) > 201 - 205

2017 IEEE 28th International Conference on Application-specific Systems, Architectures and Processors (ASAP)

Future high-performance computing systems will need to include multiple specialized accelerators in a single heterogeneous system to overcome power-density limitations of CPU performance.

chapter

Efficient Data-Driven Task Allocation for Future Many-Cluster On-chip Systems

Alberto Scionti, Somnath Mazumdar, Antoni Portero

2017 International Conference on High Performance Computing & Simulation (HPCS) > 503 - 510

2017 International Conference on High Performance Computing & Simulation (HPCS)

Continuous demand for higher performance is adding more pressure on hardware designers to provide faster machines with low energy consumption. Recent technological advancements allow placing a group of silicon dies on top of a conventional interposer (silicon layer), which provides space to integrate logic and interconnection resources to manage active processing cores. However, such large resource...

Keywords:
HARDWARE
INSTRUCTION SETS

Publication date

Set your own date range

Content availability

Available (721)
None (4)

Keywords

COMPUTER ARCHITECTURE (176)
KERNEL (131)
REGISTERS (127)
PARALLEL PROCESSING (104)
GRAPHICS PROCESSING UNITS (97)
BENCHMARK TESTING (92)
SYNCHRONIZATION (81)
FIELD PROGRAMMABLE GATE ARRAYS (79)
MULTICORE PROCESSING (77)
GRAPHICS PROCESSING UNIT (64)
COMPUTATIONAL MODELING (61)
OPTIMIZATION (61)
MESSAGE SYSTEMS (55)
PIPELINES (53)
RUNTIME (53)
MEMORY MANAGEMENT (51)
MICROPROCESSOR CHIPS (51)
EMBEDDED SYSTEMS (50)
MULTIPROCESSING SYSTEMS (43)
PROGRAMMING (41)
MULTITHREADING (39)
CLOCKS (36)
GPU (36)
PROGRAM PROCESSORS (36)
SOFTWARE (36)
CONTEXT (34)
RADIATION DETECTORS (33)
RANDOM ACCESS MEMORY (33)
FPGA (31)
MULTI-THREADING (31)
REAL-TIME SYSTEMS (31)
BANDWIDTH (30)
COPROCESSORS (29)
ARRAYS (28)
OPERATING SYSTEMS (27)
PARALLEL PROGRAMMING (27)
THROUGHPUT (27)
SYSTEM-ON-CHIP (26)
LINUX (24)
VLIW (24)
ALGORITHM DESIGN AND ANALYSIS (23)
PERFORMANCE EVALUATION (23)
RECONFIGURABLE ARCHITECTURES (23)
LIBRARIES (22)
MONITORING (22)
PROGRAM COMPILERS (22)
GPGPU (21)
INSTRUMENTS (21)
PARALLEL ARCHITECTURES (21)
PERFORMANCE (21)
RESOURCE MANAGEMENT (21)
ACCELERATION (20)
CACHE STORAGE (20)
ANALYTICAL MODELS (19)
CUDA (19)
DATA STRUCTURES (18)
PROTOCOLS (18)
SCALABILITY (18)
DATA MINING (17)
DELAY (17)
PROCESSOR SCHEDULING (17)
SWITCHES (17)
TIMING (17)
CONCURRENT COMPUTING (16)
HARDWARE-SOFTWARE CODESIGN (16)
JAVA (16)
SCHEDULING (16)
COMPUTER GRAPHIC EQUIPMENT (15)
DECODING (15)
FAULT TOLERANCE (15)
PIPELINE PROCESSING (15)
APPLICATION SPECIFIC INTEGRATED CIRCUITS (14)
COMPILER (14)
ENERGY CONSUMPTION (14)
REAL TIME SYSTEMS (14)
ASIP (13)
COHERENCE (13)
COMPUTERS (13)
ENERGY EFFICIENCY (13)
HARDWARE TRANSACTIONAL MEMORY (13)
LOGIC DESIGN (13)
MATHEMATICAL MODEL (13)
RECONFIGURABLE COMPUTING (13)
SERVERS (13)
COMPUTER BUGS (12)
FAULT TOLERANT SYSTEMS (12)
MICROARCHITECTURE (12)
MULTICORE (12)
PROTOTYPES (12)
SHARED MEMORY SYSTEMS (12)
TRANSACTIONAL MEMORY (12)
DIGITAL SIGNAL PROCESSING (11)
DYNAMIC SCHEDULING (11)
ENGINES (11)
INDEXES (11)
OPENMP (11)
POWER DEMAND (11)
PROCESS CONTROL (11)
more

INFONA - science communication portal

Search results

Congestion-aware memory management on NUMA platforms: A VMware ESXi case study

A compilation method for zero overhead loop in DSPs with VLIW

Exploring thread-level parallelism based on cost-driven model for irregular programs

Automatic, Abstracted and Portable Topology-Aware Thread Placement

A Software Solution for Hardware Vulnerabilities

Exploiting half precision arithmetic in Nvidia GPUs

RCU-HTM: Combining RCU with HTM to Implement Highly Efficient Concurrent Binary Search Trees

Implementing an ISR defense on a MIPS architecture

FPGA-based CNN inference accelerator synthesized from multi-threaded C software

A Software-Hardware Co-designed Methodology for Efficient Thread Level Speculation

Developing CPU-GPU Embedded Systems Using Platform-Agnostic Components

Towards Highly Efficient DGEMM on the Emerging SW26010 Many-Core Processor

Application-Aware Power Coordination on Power Bounded NUMA Multicore Systems

A Region-Aware Multi-Objective Auto-Tuner for Parallel Programs

A Novel Minimum Time Parallel 2-D Discrete Wavelet Transform Algorithm for General Purpose Processors

Autotuning GPU Kernels via Static and Predictive Analysis

Understanding the Impact of Fine-Grained Data Sharing and Thread Communication on Heterogeneous Workload Development

Invited paper: Secure swarm intelligence: A new approach to many-core power management

OpenMP device offloading to FPGA accelerators

Efficient Data-Driven Task Allocation for Future Many-Cluster On-chip Systems

Filter options

Publication date

Content availability

Keywords

INFONA - science communication portal

Search results

Add recipient

Sending message cancelled

Are you sure you want to cancel sending this message?

Send message

Filter options

Publication date

Date range setting

Set the date range to filter the displayed results. You can set a starting date, ending date or both. You can enter the dates manually or choose them from the calendar.

Content availability

Keywords

Reporting an error / abuse

Sending the report failed

Accessibility options