A Tuning Framework for Software-Managed Memory Hierarchies (2009)
Manman Ren, Alex Aiken, Ji Young Park, William J. Dally, Mike Houston
Achieving good performance on a modern machine with a multi-level memory hierarchy, and in particular on a machine with software-managed memories, requires precise tuning of programs to the...
Pulsenet – A Parallel Flash Sampler and Digital Processor IC for Optical SETI (2008)
Andrew W. Howard, Gu-yeon Wei, William J. Dally, Paul Horowitz
Abstract — PulseNet is a full-custom IC with parallel flash ADC and digital processing that enables an all-sky optical search for extraterrestrial intelligence. It integrates 448 sense amplifiers...
Register Pointer Architecture for Efficient Embedded Processors (2008)
Jongsoo Park, Sung-boem Park, James D. Balfour, David Black-schaffer, Christos Kozyrakis, William J. Dally
Conventional register file architectures cannot optimally exploit temporal locality in data references due to their limited capacity and static encoding of register addresses in instructions. In...
Abstract Sequoia: Programming the Memory Hierarchy (2008)
Kayvon Fatahalian, Timothy J. Knight, Mike Houston, Mattan Erez, Daniel Reiter, Horn Larkhoon, ...
We present Sequoia, a programming language designed to facilitate the development of memory hierarchy aware parallel programs that remain portable across modern machines featuring different memory...
Patrick Chiang, William J. Dally, Ramesh Senthinathan, Yangjin Oh, Mark Horowitz
A 20Gb/s transmitter is implemented in 0.13um CMOS technology. Eight 2.5Gb/s data streams are 4:1 multiplexed, sampled, and retimed into two 10Gb/s data streams. A final 20Gb/s 2:1 output...
William J. Dally, Hiromichi Aoki
Abstmct- The use of adaptive routing in a multicomputer inter-connection network improves network performance by making use of all available paths and provides fault tolerance by allowing messages to...
Sun Microsystems D.N. (Jay) (2008)
John D. Owens, William J. Dally, Ron Ho, Stephen W. Keckler, Li-shiuan Peh
...... VLSI technology’s increased capability is yielding a more powerful, more capable, and more flexible computing system on single processor die. The microprocessor industry is moving from...
A Portable Runtime Interface For Multi-Level Memory Hierarchies (2008)
Mike Houston, Ji-young Park, Manman Ren, Timothy Knight, Kayvon Fatahalian, Alex Aiken, ...
We present a platform independent runtime interface for moving data and computation through parallel machines with multi-level memory hierarchies. We show that this interface can be used as a...
Steve Scott, Dennis Abts, John Kim, William J. Dally
This paper describes the radix-64 folded-Clos network of the Cray BlackWidow scalable vector multiprocessor. We describe the BlackWidow network which scales to 32K processors with a worstcase...
Globally Adaptive Load-Balanced Routing on k-ary n-cubes (2008)
Arjun Singh, William J Dally, Brian Towles, Amit K Gupta
We introduce a new method of adaptive routing on k-ary ncubes that we refer to as Globally Adaptive Load-balance (GAL). Unlike previous adaptive routing algorithms that make routing decisions based...
Brucek Khailany, William J. Dally, Ujval J. Kapasi, Peter Mattson, Jinyung Namkoong, John D. Owens, ...
Media-processing applications, such as signal processing, 2D- and 3D-graphics rendering, and image and audio compression and decompression, are the dominant workloads in many systems today. The...
Abstract Sequoia: Programming the Memory Hierarchy (2008)
Kayvon Fatahalian, Timothy J. Knight, Mike Houston, Mattan Erez, Daniel Reiter, Horn Larkhoon, ...
We present Sequoia, a programming language designed to facilitate the development of memory hierarchy aware parallel programs that remain portable across modern machines featuring different memory...
CMOS High-Speed I/Os — Present and Future (2008)
William J. Dally, Ramin Farjad-rad, Hiok-tiaq Ng, Ramesh Senthinathan, John Edmondson, ...
High-speed I/O circuits, once used only for PHYs, are now widely used for intra-system signaling as well because of their bandwidth, power, area, and cost advantages. This technology enables chips...
as the Cell Broadband Engine, Stanford’s Merrimac, and Clear- (2008)
Mattan Erez, Jung Ho Ahn, Jayanth Gummaraju, Mendel Rosenblum, William J. Dally
The recent emergence of compute-intensive stream processors such
ABSTRACT Communication Scheduling (2008)
Peter Mattson, William J. Dally, Scott Rixner, Ujval J. Kapasi, John D. Owens
The high arithmetic rates of media processing applications require architectures with tens to hundreds of functional units, multiple register files, and explicit interconnect between functional units...
Architectural Support for the Stream Execution Model on General-Purpose Processors (2008)
Jayanth Gummaraju, Mattan Erez, Joel Coburn, Mendel Rosenblum, William J. Dally
There has recently been much interest in stream processing, both in industry (e.g., Cell, NVIDIA G80, ATI R580) and academia (e.g., Stanford Merrimac, MIT RAW), with stream programs becoming...
Sun Microsystems D.N. (Jay) (2008)
John D. Owens, William J. Dally, Ron Ho, Stephen W. Keckler, Li-shiuan Peh
...... VLSI technology’s increased capability is yielding a more powerful, more capable, and more flexible computing system on single processor die. The microprocessor industry is moving from...
Abstract Smart Memories: A Modular Reconfigurable Architecture (2008)
Ken Mai, Tim Paaske, Nuwan Jayasena, Ron Ho, William J. Dally, Mark Horowitz
Trends in VLSI technology scaling demand that future computing devices be narrowly focused to achieve high performance and high efficiency, yet also target the high volumes and low costs of widely...
How to Choose the Grain Size of a Parallel Computer (2008)
Donald Yeung, William J. Dally, Anant Agarwal
Abstract Designers of parallel computers have to decide how to apportion a machine's resources between processing, memory, and communication. How these resources are apportioned determine the...
Abstract Smart Memories: A Modular Reconfigurable Architecture (2008)
Ken Mai, Tim Paaske, Nuwan Jayasena, Ron Ho, William J. Dally, Mark Horowitz
Trends in VLSI technology scaling demand that future computing devices be narrowly focused to achieve high performance and high efficiency, yet also target the high volumes and low costs of widely...
Li-shiuan Peh, William J. Dally
Buses and crossbars have traditionally served as the communication fabrics of computer systems, network switches and routers, and other digital systems. However, demand for communication bandwidth is...
How to Choose the Grain Size of a Parallel Computer (2008)
Donald Yeung, William J. Dally, Anant Agarwal
Designers of parallel computers have to decide how to apportion a machine's resources between processing, memory, and communication. How these resources are apportioned determine the grain and...
Throughput-Centric Routing Algorithm Design (2008)
Brian Towles William, William J. Dally, Stephen Boyd
The increasing application space of interconnection networks now encompasses several applications, such as packet routing and I/O interconnect, where the throughput of a routing algorithm, not just...
A Data-Driven IDCT Architecture for Low Power Video Applications (2007)
Thucydides Xanthopoulos, Anantha P. Chandrakasan, Charles G. Sodini, William J. Dally
Abstract. Analysis of transform coded �MPEG � video data streams reveals a large percentage of zero-valued Discrete Cosine Transform �DCT � coe�cients. A Data-Driven 2D IDCT architecture...
M-Machine Microarchitecture v1.1 1 (2007)
William J. Dally, Stephen W. Keckler, Nick Carter, Andrew Chang, Marco Fillo, Whay S. Lee
This document describes the microarchitecture of the MIT M-Machine. It details the machine 's organization in terms of arithmetic units, switches, buses, memories, and control units. The...
Certified by ______________________________________________________ (2007)
William J. Dally, Scott Rixner, Scott Rixner
A Bandwidth-efficient Architecture for a Streaming Media Processor by
Experiences Implementing Dataflow on a GeneralPurpose Parallel Computer (2007)
Ellen Spertus, William J. Dally
Abstract--- The MIT J-Machine [3], a massively-parallel computer, is an experiment in providing general-purpose mechanisms for communication, synchronization, and naming that will support a wide...
Transmitter Equalization for 4Gb/s Signalling (2007)
William J. Dally, John Poulton
To operate a serial channel over copper wires at 4Gb/s, we incorporate an 4GHz FIR equalizing filter into a differential transmitter. The equalizer cancels the frequency-dependent attenuation caused...
A Data-Driven IDCT Architecture for Low Power Video Applications (2007)
Thucydides Xanthopoulos, Anantha P. Chandrakasan, Charles G. Sodini, William J. Dally
. Analysis of transform coded (MPEG) video data streams reveals a large percentage of zero-valued Discrete Cosine Transform (DCT) coefficients. A Data-Driven 2D IDCT architecture (DDIDCT) is proposed...
A Data-Driven IDCT Architecture for Low Power Video Applications (2007)
Thucydides Xanthopoulos, Anantha P. Chandrakasan, Charles G. Sodini, William J. Dally
. Analysis of transform coded (MPEG) video data streams reveals a large percentage of zero-valued Discrete Cosine Transform (DCT) coefficients. A Data-Driven 2D IDCT architecture (DDIDCT) is proposed...
Comparing Reyes and OpenGL on a Stream Architecture Abstract (2007)
Thomas Ertl, Wolfgang Heidrich, Michael Doggett (editors, John D. Owens, Brucek Khailany, Brian Towles, ...
The OpenGL and Reyes rendering pipelines each render complex scenes from similar scene descriptions but differ in their internal pipeline organizations. While the OpenGL organization has dominated...
Appears in the Proceedings of MICRO-28. The M-Machine Multicomputer (2007)
Marco Fillo, Stephen W. Keckler, William J. Dally, Nicholas P. Carter, Andrew Chang, Yevgeny Gurevich, ...
The M-Machine is an experimental multicomputer being developed to test architectural concepts motivated by the constraints of modern semiconductor technology and the demands of programming systems....
Nicholas P. Carter, Stephen W. Keckler, William J. Dally
Traditional methods of providing protection in memory systems do so at the cost of increased context switch time and/or increased storage to recordaccess permissions for processes. With the advent of...
Performance Analysis Interconnection of k-ary n-cube Networks (2007)
.4bstmct--VLSl communication networks are wire-limited. The cost of a network is not a function of the number of switches required, but rather a function of the wiring density required to construct...
Michael D. Noakes, Deborah A. Wallach, William J. Dally
The MIT J-Machine multicomputer has been constructed to study the role of a set of primitive mechanisms in providing efficient support for parallel computing. Each J-Machine node consists of an...
Abstract Memory Access Scheduling (2007)
Scott Rixner, William J. Dally, Ujval J. Kapasi, Peter Mattson, John D. Owens
The bandwidth and latency of a memory system are strongly dependent on the manner in which accesses interact with the “3-D ” structure of banks, rows, and columns characteristic of contemporary...
Exploiting Fine--Grain Thread Level (2007)
Parallelism On The, Stephen W. Keckler, William J. Dally, Daniel Maskit, Nicholas P. Carter, Andrew Chang, ...
Much of the improvement in computer performance over the last twenty years has come from faster transistors and architectural advances that increase parallelism. Historically, parallelism has been...
Best Student Paper, Li-shiuan Peh, William J. Dally
models key aspects of modern routers. The model accounts for the pipelined nature of contemporary routers, the specific flow control method employed, the delay of the flowcontrol credit path, and the...
Compilation for explicitly managed memory hierarchies (2007)
Timothy J. Knight, Ji Young, Park Manman, Ren Mike Houston, Mattan Erez, Kayvon Fatahalian, ...
We present a compiler for machines with an explicitly managed memory hierarchy and suggest that a primary role of any compiler for such architectures is to manipulate and schedule a hierarchy of bulk...
Flattened butterfly: A cost-efficient topology for high-radix networks (2007)
Increasing integrated-circuit pin bandwidth has motivated a corresponding increase in the degree or radix of interconnection networks and their routers. This paper introduces the flattened butterfly,...
Fault tolerance techniques for the merrimac streaming supercomputer (2005)
Mattan Erez, Nuwan Jayasena, Timothy J. Knight, William J. Dally
As device scales shrink, higher transistor counts are available while soft-errors, even in logic, become a major concern. A new class of architectures, such as Merrimac and the IBM Cell, take...
Fault tolerance techniques for the merrimac streaming supercomputer (2005)
Mattan Erez, Nuwan Jayasena, Timothy J. Knight, William J. Dally
As device scales shrink, higher transistor counts are available while soft-errors, even in logic, become a major concern. A new class of architectures, such as Merrimac and the IBM Cell, take...
Microarchitecture of a high-radix router (2005)
John Kim, William J. Dally, Brian Towles, Amitk. Gupta
Evolving semiconductor and circuit technology has greatly increased the pin bandwidth available to a router chip. In the early 90s, routers were limited to 10Gb/s of pin bandwidth. Today 1Tb/s is...
RESOURCE MANAGEMENT IN SINGLE-CHIP MULTIPROCESSORS (2005)
A. Shaw, William J. Dally, Oyekunle A. Olukotun
ii
Microarchitecture of a high-radix router (2005)
John Kim, William J. Dally, Brian Towles, Amit K. Gupta
Evolving semiconductor and circuit technology has greatly increased the pin bandwidth available to a router chip. In the early 90s, routers were limited to 10Gb/s of pin bandwidth. Today 1Tb/s is...
Evaluating the imagine stream architecture (2004)
Jung Ho Ahn, William J. Dally, Brucek Khailany, Ujval J. Kapasi, Abhishek Das
This paper describes an experimental evaluation of the prototype Imagine stream processor. Imagine [8] is a stream processor that employs a two-level register hierarchy with 9.7 Kbytes of local...
Globally adaptive load-balanced routing on tori (2004)
Arjun Singh, William J Dally, Brian Towles, Amit K Gupta
Abstract — We introduce a new method of adaptive routing
Adaptive channel queue routing on k-ary n-cubes (2004)
Arjun Singh, William J Dally, Amit K Gupta, Brian Towles
This paper introduces a new adaptive method, Channel Queue Routing (CQR), for load-balanced routing on k-ary n-cube interconnection networks. CQR estimates global congestion in the network from its...
Stream Register Files with Indexed Access (2004)
Nuwan Jayasena, Mattan Erez, Jung Ho Ahn, William J. Dally
Many current programmable architectures designed to exploit data parallelism require computation to be structured to operate on sequentially accessed vectors or streams of data. Applications with...
Programmable stream processors (2003)
Ujval J. Kapasi, William J. Dally, Scott Rixner, John D. Owens, Brucek Khailany
The Imagine Stream Processor is a single-chip programmable media processor with 48 parallel ALUs. At 400 MHz, this translates to a peak arithmetic rate of 16 GFLOPS on single-precision data and 32...
Programmable stream processors (2003)
Ujval J. Kapasi, William J. Dally, Scott Rixner, John D. Owens, Brucek Khailany
The Imagine Stream Processor is a single-chip programmable media processor with 48 parallel ALUs. At 400 MHz, this translates to a peak arithmetic rate of 16 GFLOPS on single-precision data and 32...
Exploring the VLSI scalability of stream processors (2003)
Brucek Khailany, William J. Dally, Scott Rixner, Ujval J. Kapasi, John D. Owens, Brian Towles
Stream processors are high-performance programmable processors optimized to run media applications. Recent work has shown these processors to be more area- and energy-efficient than conventional...
Exploring the VLSI scalability of stream processors (2003)
Brucek Khailany, William J. Dally, Scott Rixner, Ujval J. Kapasi, John D. Owens, Brian Towles
Stream processors are high-performance programmable processors optimized to run media applications. Recent work has shown these processors to be more area- and energy-efficient than conventional...
GOAL: A load-balanced adaptive routing algorithm for torus networks (2003)
Arjun Singh, William J Dally, Amit K Gupta, Brian Towles
We introduce a load-balanced adaptive routing algorithm for torus networks, GOAL- Globally Oblivious Adaptive Locally- that provides high throughput on adversarial traffic patterns, matching or...
Guaranteed scheduling for switches with configuration overhead (2003)
Brian Towles, Student Member, William J. Dally
Abstract—In this paper, we present three algorithms that provide performance guarantees for scheduling switches, such as optical switches, with configuration overhead. Each algorithm emulates an...
A Stream Processor Development Platform (2002)
Ben Serebrin, John D. Owens, Chen Chen H, Stephen Crago P, Ujval J. Kapasi, Brucek Khailany, ...
We describe a hardware and software platform for developing streaming applications. Programmers write stream programs in high-level languages, and a set of software tools maps these programs to code...
VLSI design and verification of the Imagine processor (2002)
Brucek Khailany, William J. Dally, Andrew Chang, Ujval J. Kapasi, Jinyung Namkoong, Brian Towles
The Imagine stream processor is a 21 million transistor chip implemented by a collaboration between Stanford Unversity and Texas Instruments in a 1.5V 0.15 µm process with five layers of aluminum...
Media processing applications on the Imagine stream processor (2002)
John D. Owens, Scott Rixner, Ujval J. Kapasi, Peter Mattson, Brian Towles, Ben Serebrin, ...
Media applications, such as image processing, signal processing, video, and graphics, require high computation rates and data bandwidths. The stream programming model is a natural and powerful way to...
Migration in Single Chip Multiprocessors (2002)
Kelly A. Shaw, William J. Dally
Abstract--- Global communication costs in future single-chip multiprocessors will increase linearly with distance. In this paper, we revisit the issues of locality and load balance in order to take...
Scalable Opto-Electronic Network (SOENet) (2002)
Amit K. Gupta, William J. Dally, Arjun Singh, Brian Towles
In applications such as processor-memory interconnect, I/O networks, and router switch fabrics, an interconnection network must be scalable to thousands of high-bandwidth terminals while at the same...
Guaranteed Scheduling for Switches with Configuration Overhead (2002)
Brian Towles, William J. Dally
In this paper we present three algorithms that provide performance guarantees for scheduling switches, such as optical switches, with configuration overhead. Each algorithm emulates an unconstrained...
Locality-preserving randomized oblivious routing on torus networks (2002)
Arjun Singh, William J. Dally, Brian Towles, Amit K. Gupta
We introduce Randomized Local Balance (RLB), a routing algorithm that strikes a balance between locality and load balance in torus networks, and analyze RLB’s performance for benign and adversarial...
Stream Processor Architecture / S. Rixner ; pról. de W.J. Dally. (2002)
Rixner, Scott, Dally, William J. (pról.)
Contenido: 1) Introducción; 2) Antecedentes; 3) Medios de Procesamiento de Aplicaciones. 4) El procesador Imagine Stream; 5) Jerarquía de datos de ancho de banda; 6) Programación de la memoria de...
Ujval J. Kapasi, Peter Mattson, William J. Dally, John D. Owens, Brian Towles
Media applications can be cast as a set of computation kernels operating on sequential data streams. This representation exposes the structure of an application to the hardware as well as to the...
An 84-mw 4gb/s clock and data recovery circuit for serial link applications (2001)
William J. Dally, John W. Poulton, Patrick Chiang, Stephen F. Greenwood
A 4Gb/s serial link tracking clock and data recovery (CDR) circuit fabricated in 0.24μm CMOS technology dissipates 84mW and occupies 0.3mm2. The input signal is 2× oversampled by 8 offset-cancelled...
Ujval J. Kapasi, Peter Mattson, William J. Dally, John D. Owens, Brian Towles
Media applications can be cast as a set of computation kernels operating on sequential data streams. This representation exposes the structure of an application to the hardware as well as to the...
A delay model and speculative architecture for pipelined routers (2001)
Li-shiuan Peh, William J. Dally
This paper introduces a router delay model that accurately models key aspects of modern routers. The model accounts for the pipelined nature of contemporary routers, the specific flow control method...
A delay model for router microarchitectures (2001)
Li-shiuan Peh, William J. Dally
Abstract. Current router models [2, 3, 5, 6] assume that clock cycle time depends solely on router latency. However, in practice, routers are heavily pipelined, making cycle time largely independent...
References Vectors and Streams (2000)
Lecturer Melvyn Lim, Scott Rixner, William J. Dally, Ujval J. Kapasi, Brucek Khailany, Abelardo Lopez, ...
Up to now, we have only looked at techniques that exploit instruction level parallelism to increase single-thread processor performance. The instruction level parallelism dictates how many...
Low-power area-efficient high-speed I/O circuit techniques (2000)
William J. Dally, Patrick Chiang
Abstract—We present a 4-Gb/s I/O circuit that fits in 0.1-mmP of die area, dissipates 90 mW of power, and operates over 1 m
Efficient conditional operations for data-parallel architectures (2000)
Ujval J. Kapasi, William J. Dally, Scott Rixner, Peter R. Mattson, John D. Owens, Brucek Khailany
Many data-parallel applications, including emerging media applications, have regular structures that can easily be expressed as a series of arithmetic kernels operating on data streams. Data-parallel...
Register Organization for Media Processing (2000)
Scott Rixner William, William J. Dally, Brucek Khailany, Peter Mattson, Ujval J. Kapasi, John D. Owens
Processor architectures with tens to hundreds of arithmetic units are emerging to handle media processing applications. These applications, such as image coding, image synthesis, and image...
Memory Access Scheduling (2000)
Scott Rixner, William J. Dally, Ujval J. Kapasi, Peter Mattson, John D. Owens
The bandwidth and latency of a memory system are strongly dependent on the manner in which accesses interact with the "3-D" structure of banks, rows, and columns characteristic of...
The Role of Custom Design in ASIC Chips (2000)
William J. Dally, Andrew Chang
Custom design, in which the designer controls the physical structure of the chip, can greatly improve the speed, power, and delay of an ASIC chip without affecting design time. Through floorplanning...
Memory Access Scheduling (2000)
Scott Rixner William, William J. Dally, Ujval J. Kapasi, Peter Mattson, John D. Owens
The bandwidth and latency of a memory system are strongly dependent on the manner in which accesses interact with the "3-D" structure of banks, rows, and columns characteristic of...
Register Organization for Media Processing (2000)
Scott Rixner, William J. Dally, Brucek Khailany, Peter Mattson, Ujval J. Kapasi, John D. Owens
Processor architectures with tens to hundreds of arithmetic units are emerging to handle media processing applications. These applications, such as image coding, image synthesis, and image...
Flit-Reservation Flow Control (2000)
Li-shiuan Peh, William J. Dally
This paper presents flit-reservation flow control, in which control flits traverse the network in advance of data flits, reserving buffers and channel bandwidth. Flit-reservation flow control...
Concurrent Event Handling through Multithreading (1999)
Stephen W. Keckler, Andrew Chang, Student Member, Whay S. Lee, Eep Chatterjee, William J. Dally
AbstractÐExceptions have traditionally been used to handle infrequently occurring and unpredictable events during normal program execution. Current trends in microprocessor and operating systems...
Brian Towles, William J. Dally, Stephen Boyd
The increasing application space of interconnection networks now encompasses several applications, such as packet routing and I/O interconnect, where the throughput of a routing algorithm, not just...
Brian Towles, William J. Dally, Stephen Boyd
The increasing application space of interconnection networks now encompasses several applications, such as packet routing and I/O interconnect, where the throughput of a routing algorithm, not just...
Scalable Switching Fabrics for Internet Routers (1999)
The exponential growth of the Internet is driving a demand for routers that operate at increasing bit-rates (OC48 to OC192 to OC768) and that have a very large number of ports (10s to 100s to 1000s)....
VLSI architecture: past, present, and future (1999)
This paper examines the impact of VLSI technology on the evolution of computer architecture and projects the future of this evolution. We see that over the past 20 years, the increased density of...
The effects of explicitly parallel mechanisms on the Multi-ALU processor cluster pipeline (1998)
Andrew Chang, William J. Dally, Stephen W. Keckler, Nicholas P. Carter, Whay S. Lee
Continuing reductions in on-chip geometries yield increasing numbers of transistors per chip and fundamentally faster devices but also result in effectively slower wires. This combination presents...
Exploiting fine-grain thread level parallelism on the MIT Multi-ALU processor (1998)
Stephen W. Keckler, William J. Dally, Daniel Maskit, Nicholas P. Cm'ter, Andrew Chang, Whay S. Leei
Much of the improvement in computer performance over the last twenty years has come from faster transistors and architectural advances that increase parallelism. Historically, parallelism has been...
Exploiting fine-grain thread level parallelism on the MIT Multi-ALU processor (1998)
Stephen W. Keckler, William J. Dally, Daniel Maskit, Nicholas P. Carter, Andrew Chang, Whay S. Leey
Much of the improvement in computer performance over the last twenty years has come from faster transistors and architectural advances that increase parallelism. Historically, parallelism has been...
A bandwidth-efficient architecture for media processing (1998)
Scott Rixner, William J. Dally, Ujval J. Kapasi, Brucek Khailany, Abelardo López-lagunas, Peter R. Mattson, ...
Media applications are characterized by large amounts of available parallelism, little data reuse, and a high computation to memory access ratio. While these characteristics are poorly matched to...
Exploiting Fine--Grain Thread Level Parallelism on the MIT Multi-ALU Processor (1998)
Stephen Keckler William, William J. Dally, Daniel Maskit, Nicholas P. Carter, Andrew Chang, Whay S. Leey
Much of the improvement in computer performance over the last twenty years has come from faster transistors and architectural advances that increase parallelism. Historically, parallelism has been...
A Bandwidth-Efficient Architecture for Media Processing (1998)
Scott Rixner, William J. Dally, Ujval J. Kapasi, Brucek Khailany, Abelardo López-Lagunas, Peter R. Mattson, ...
Media applications are characterized by large amounts of available parallelism, little data reuse, and a high computation to memory access ratio. While these characteristics are poorly matched to...
The Effects of Explicitly Parallel Mechanisms on the Multi-ALU Processor Cluster Pipeline (1998)
Andrew Chang, William J. Dally, Stephen W. Keckler, Nicholas P. Carter, Whay S. Lee, Whay S. Leey
Continuing reductions in on-chip geometries yield increasing numbers of transistors per chip and fundamentally faster devices but also result in effectively slower wires. This combination presents...
J. P. Grossman, William J. Dally
We present an algorithm suitable for real-time, high quality rendering of complex objects. Objects are represented as a dense set of surface point samples which contain colour, depth and normal...
Grossman Hon Sc, William J. Dally, J. P. Grossman, J. P. Grossman, Seth Teller, ...
We present an algorithm suitable for real-time, high quality rendering of complex objects. Objects are represented as a dense set of surface point samples which contain colour, depth and normal...
Grossman And William, J. P. Grossman, William J. Dally
We present an algorithm suitable for real-time, high quality rendering of complex objects. Objects are represented as a dense set of surface point samples which contain colour, depth and normal...
Retrospective: the j-machine (1998)
William J. Dally, Andrew Chang, Andrew Chien, Stuart Fiskeg, Waldemar Horwat, John Keeng, ...
leven years ago, at ISCA 14, we published a paper titled, “Architecture of a Message-Driven Processor ” [l] marking the start of our J-Machine project at MIT. The project culminated with the...
A Tracking Clock Recovery Receiver for 4Gb/s Signaling (1997)
John Poulton, William J. Dally, Steve Tell
Abstract We have previously described a design for a 4Gb/s signaling system that uses transmitter equalization to overcome the frequency-dependent attenuation due to skin effect in transmission...
Transmitter Equalization for 4Gb/s Signalling (1997)
William J. Dally, John Poulton
To operate a serial channel over copper wires at 4Gb/s, we incorporate an 4GHz FIR equalizing filter into a differential transmitter. The equalizer cancels the frequency-dependent attenuation caused...
The Delta Tree: An Object-Centered Approach to Image-Based Rendering (1996)
William J. Dally, Leonard Mcmillan, Gary Bishop, Henry Fuchs
This paper introduces the delta tree, a data structure that represents an object using a set of reference images. It also describes an algorithm for generating arbitrary re-projections of an object...
The Delta Tree: An Object-Centered Approach to Image-Based Rendering (1996)
William J. Dally, Leonard McMillan, Gary Bishop, Henry Fuchs
This paper introduces the delta tree, a data structure that represents an object using a set of reference images. It also describes an algorithm for generating arbitrary re-projections of an object...
The Delta Tree: An Object-Centered Approach to Image-Based Rendering (1996)
William J. Dally, Leonard Mcmillan, Gary Bishop, Henry Fuchs
This paper introduces the delta tree, a data structure that represents an object using a set of reference images. It also describes an algorithm for generating arbitrary re-projections of an object...
The m-machine multicomputer (1995)
Marco Fillo, Stephen W. Keckler, William J. Dally, Nicholas P. Carter, Andrew Chang, Yevgeny Gurevich, ...
This publication can be retrieved by anonymous ftp to publications.ai.mit.edu. The M{Machine is an experimental multicomputer being developed to test architectural concepts motivated by the...
The M-Machine Multicomputer (1995)
Marco Fillo, Stephen W. Keckler, William J. Dally, Nicholas P. Carter, Andrew Chang, Yevgeny Gurevich, ...
The M--Machine is an experimental multicomputer being developed to test architectural concepts motivated by the constraints of modern semiconductor technology and the demands of programming systems....
The M-Machine Multicomputer (1995)
Marco Fillo, Stephen W. Keckler, William J. Dally, Nicholas P. Carter, Andrew Chang, Yevgeny Gurevich, ...
The M-Machine is an experimental multicomputer being developed to test architectural concepts motivated by the constraints of modern semiconductor technology and the demands of programming systems....
The Named-State Register File: Implementation and Performance (1995)
Context switches are slow in conventional processors because the entire processor state must be saved and restored, even if much of the state is not used before the next context switch. This paper...
Low-Latency Plesiochronous Data Retiming (1995)
Larry R. Dennison, William J. Dally, Duke Xanthopoulos
A new method of retiming plesiochronous data is described. This method features latency of less than a cell-time and requires only minimal support circuitry. No flow control or handshaking signals...
Fault Tolerant Adaptive Routing in Multicomputer Networks (1995)
William J. Dally, Frederic R. Morgenthaler, Thucydides Xanthopoulos, Thucydides Xanthopoulos
Interconnection networks play a major role in the performance and reliability of massively parallel processors (MPPs). This work is concerned with the design and implementation of a wormhole...
Evaluating the Locality Benefits of Active Messages (1995)
Ellen Spertus, William J. Dally
A major challenge in fine-grained computing is achieving locality without excessive scheduling overhead. We built two J-Machine implementations of a fine-grained programming model, the Berkeley...
Thread Prioritization: A Thread Scheduling Mechanism for Multiple-Context Parallel Processors (1995)
Stuart Fiske, William J. Dally
Multiple-context processors provide register resources that allow rapid context switching between several threads as a means of tolerating long communication and synchronization latencies. When...
The M-Machine Multicomputer (1995)
Marco Fillo, Stephen W. Keckler, William J. Dally, Nicholas P. Carter, Andrew Chang, Yevgeny Gurevich, ...
The M--Machine is an experimental multicomputer being developed to test architectural concepts motivated by the constraints of modern semiconductor technology and the demands of programming systems....
The Subspace Model: A Theory of Shapes for Parallel Systems (1995)
Kathleen Knobe, William J. Dally
This paper presents a shape based abstraction for compiling to parallel systems. Data layout is often the subject of direct analysis while shape is addressed in ad hoc ways at best. However, a...
The Named-State Register File: Implementation and Performance (1995)
Peter R. Nuth, William J. Dally
Context switches are slow in conventional processors because the entire processor state must be saved and restored, even if much of the state is not used before the next context switch. This paper...
The Subspace Model: A Theory of Shapes for Parallel Systems (1995)
Kathleen Knobe, William J. Dally
This paper presents a shape based abstraction. Data layout is often the subject of direct analysis while shape is addressed in ad hoc ways at best. However, a suboptimal shape can be more costly than...
William J. Dally, Larry R. Dennison, David Harris, Kinhong Kan, Thucydides Xanthopoulos
. The Reliable Router (RR) is a network switching element targeted to two-dimensional mesh interconnection network topologies. It is designed to run at 100 MHz and reach a useful link bandwidth of...
Hardware Support for Fast Capability-based Addressing (1994)
Nicholas Carter, Stephen W. Keckler, William J. Dally
Traditional methods of providing protection in memory systems do so at the cost of increased context switch time and/or increased storage to recordaccess permissions for processes. With the advent of...
High-Performance Bidirectional Signalling in VLSI Systems (1993)
Larry Dennison, Whay S. Lee, William J. Dally
Interchip I/O bandwidth is a critical bottleneck in VLSI systems. To make the best use of this resource the conventions and circuits used for inter-chip signaling must be optimized to achieve the...
Evaluation of Mechanisms for Fine-Grained Parallel Programs in the J-Machine and the CM-5 (1993)
Ellen Spertus, Seth Copen Goldstein, Klaus Erik Schauser, Thorsten Von Eicken, David E. Culler, William J. Dally
This paper uses an abstract machine approach to compare the mechanisms of two parallel machines: the J-Machine and the CM-5. High-level parallel programs are translated by a single optimizing...
The J-Machine Multicomputer: An Architectural Evaluation (1993)
Michael Noakes, Deborah A. Wallach, William J. Dally
The MIT J-Machine multicomputer has been constructed to study the role of a set of primitive mechanisms in providing efficient support for parallel computing. Each J-Machine node consists of an...
The J-Machine Multicomputer: An Architectural Evaluation (1993)
Michael Noakes, Deborah A. Wallach, William J. Dally
The MIT J-Machine multicomputer has been constructed to study the role of a set of primitive mechanisms in providing efficient support for parallel computing. Each J-Machine node consists of an...
The J-Machine Multicomputer: An Architectural Evaluation (1993)
Michael Noakes, Deborah A. Wallach, William J. Dally
The MIT J-Machine multicomputer has been constructed to study the role of a set of primitive mechanisms in providing efficient support for parallel computing. Each J-Machine node consists of an...
The J-Machine Multicomputer: An Architectural Evaluation (1993)
Michael Noakes, Deborah A. Wallach, William J. Dally
The MIT J-Machine multicomputer has been constructed to study the role of a set of primitive mechanisms in providing efficient support for parallel computing. Each J-Machine node consists of an...
Evaluation of Mechanisms for Fine-Grained Parallel Programs in the J-Machine and the CM-5 (1993)
Ellen Spertus, Seth Copen Goldstein, William J. Dally, Klaus Erik Schauser, Thorsten Von Eicken, David E. Culler
Experimental and commercial parallel machines have matured to a point where it is possible to quantify the performance enhancement due to the novel mechanisms supporting fine-grain parallel programs...
Evaluation of mechanisms for fine-grained parallel programs (1993)
Ellen Spertus, Seth Copen Goldstein, William J. Dally, Klaus Erik Schauser, Thorsten Von Eicken, David E. Culler
Experimental and commercial parallel machines have matured to a point where it is possible to quantify the performance enhancement due to the novel mechanisms supporting fine-grain parallel programs...
Processor coupling: Integrating compile time and runtime scheduling for parallelism (1992)
Stephen W. Keckler, William J. Dally
high-performance floating-point ALUs will be available by 1995. This paper presents processor coupling,a mechanism for controlling multiple ALUs to exploit both instruction-level and inter-thread...
Processor Coupling: Integrating Compile Time and Runtime Scheduling for Parallelism (1992)
Stephen Keckler, William J. Dally
The technology to implement a single-chip node composed of 4 high-performance floating-point ALUs will be available by 1995. This paper presents processor coupling,a mechanism for controlling...
Peter Nuth William, William J. Dally
The J-Machine network is a 3-D mesh employing wormhole routing and virtual channels to provide two network priorities. Each network channel is 9 bits wide and operates at 32MHz. Each J-Machine node...
Experiments with Dataflow on a General-Purpose Parallel Computer (1991)
Ellen Spertus, William J. Dally
: The MIT J-Machine [2], a massively-parallel computer, is an experiment in providing general-purpose mechanisms for communication, synchronization, and naming that will support a wide variety of...
A Mechanism for Efficient Context Switching (1991)
Peter R. Nuth, William J. Dally
Context switches are slow in conventional processors because the entire processor state must be saved and restored, even if much of the restored state is not used before the next context switch. This...
Performance Analysis of k-ary n-cube Interconnection Networks (1990)
Abstmct- VLSI communication networks are wire-limited. The cost of a network is not a function of the number of switches required, but rather a function of the wiring density required to construct...
Deadlock-Free Message Routing in Multiprocessor Interconnection Networks (1988)
Dally, William J., Seitz, Charles L.
A deadlock-free routing algorithm can be generated for arbitrary interconnection networks using the concept of virtual channels. A necessary and sufficient condition for deadlockfree routing is the...
Performance Analysis of k-ary n-cube Interconnection Networks (1988)
VLSI communication networks are wire limited. The cost of a network is not a function of the number of switches required, but rather a function of the wiring density required to construct the...
On the Performance of k-ary n-cube Interconnection Networks (1986)
The performance of k-ary n-cube interconnection networks is analyzed under the assumption of constant wire bisection. It is shown that low-dimensional k-ary n-cube networks (e.g., tori) have lower...
Deadlock Free Message Routing in Multiprocessor Interconnection Networks (1986)
Dally, William J., Seitz, Charles L.
A deadlock-free routing algorithm can be generated for arbitrary interconnection networks using the concept of virtual channels. A necessary and sufficient condition for deadlockfree routing is the...
Dally, William J., Seitz, Charles L.
The torus routing chip (TRC) is a self-timed chip that performs deadlock-free cut-through routing in k-ary n-cube multiprocessor interconnection networks using a new method of deadlock avoidance...
A VLSI Architecture for Concurrent Data Structures (1986)
Concurrent data structures simplify the development of concurrent programs by encapsulating commonly used mechanisms for synchronization and communication into data structures. This thesis develops a...
The Balanced Cube: A Concurrent Data Structure (1985)
Dally, William J., Seitz, Charles L.
This paper describee the balanced cube, a new data structure for implementing ordered seta. Conventional dats structures such as heaps, balanced trees and B-trees have root bottlenecks which limit...
An Object Oriented Architecture (1984)
Dally, William J., Kajiya, James T.
We propose a new machine architecture for high performance execution of late binding object oriented languages The two principal mechanisms for attaining this goal are a fast context...
The MOSSIM Simulation Engine Architecture and Design (1984)
As the complexity of VLSI circuits approaches 10 to the power of 6 devices, the computational requirements of design verification are exceeding the capacity of general purpose computers. To provide...
Efficient, Protected Message Interface in the MIT M-Machine
Whay Sing Lee, William J. Dally, Stephen W. Keckler, Nicholas P. Carter, Andrew Chang
We present a user-level message interface that provides high performance and very low processor overhead. In this system, messages are launched from within the user's general register file, and...