Dealing with Memory Latency

Supercomputers are built to solve large, computationally intensive problems. That means that the primary storage subsystem must both be very large and have a very high bandwidth. A fast absolute access time is also very highly desirable, but it's been found to be easier to engineer around long latencies than it has to engineer around a lack of storage bandwidth.

Storage Hierarchies where one or more levels of cache memory are associated with each processor, reduces average memory latency considerably, by drastically reducing the latency of cache "hits" while adding little or nothing to the latency of cache misses. Cache effects on performance are difficult to predict and occasionally perverse, and for some time high performance computer designers declined to use them, particularly for data memory. Their effectiveness for general computing problems and their ubiquity in server and workstation processors has motivated software designers to cope with the additional complexity of generating software that takes cache organization into consideration.
Vector Computation may have initially been motivated by limitations in instruction bandwidth, but vector loads and stores have a regularity that can be exploited by multi-banked memory systems.
Distributed Processing using parallel programming techniques to decompose the problem so that each of many nodes needs access to only a small fraction of the total data and the total bandwidth. Depending on the problem to be solved, this can be trivial or it can be provably impossible, but the leverage is so huge that a large body of algorithm and tool work now exists to exploit it.
Multithreading is an processor architecture technique, originally deployed in I/O channel processors, but directed at high performance computing by Burton Smith at Denelcor and Tera. If a problem can be expressed as a parallel program, parallel program elements can be run concurrently on the same processor, with each thread getting some subset of processor cycles. If each of T threads gets to issue a single instruction every N cycles, the absolute time difference between the generation of a memory reference by a thread and the point at which the processor needs the value to avoid stalling is multiplied by N, reducing the sensitivity to memory latency.

My favorite quote about computer architecture comes not from a mathematician or an engineer, but from General Omar Bradley, who is cited as having said "Amateurs talk strategy. Professionals talk logistics". To my mind, computer architecture, at least in its turn-of-the-21st-century form, is no longer about instruction execution strategy, but about instruction and operand logistics.

Dealing with Memory Latency

Back to Architectural Themes Index