Dealing with Memory Latency
Supercomputers are built to solve large, computationally intensive problems.
That means that the primary storage subsystem must both be very large and have a
very high bandwidth. A fast absolute access time is also very highly
desirable, but it's been found to be easier to engineer around long latencies
than it has to engineer around a lack of storage bandwidth.
- Storage Hierarchies where one or more levels of cache memory are
associated with each processor, reduces average memory latency considerably,
by drastically reducing the latency of cache "hits" while adding little or
nothing to the latency of cache misses. Cache effects on performance are
difficult to predict and occasionally perverse, and for some time high
performance computer designers declined to use them, particularly for data
memory. Their effectiveness for general computing problems and their ubiquity
in server and workstation processors has motivated software designers to cope
with the additional complexity of generating software that takes cache
organization into consideration.
- Vector Computation may have initially been motivated by limitations
in instruction bandwidth, but vector loads and stores have a regularity that
can be exploited by multi-banked memory systems.
- Distributed Processing using parallel programming techniques to
decompose the problem so that each of many nodes needs access to only a small
fraction of the total data and the total bandwidth. Depending on the
problem to be solved, this can be trivial or it can be provably impossible,
but the leverage is so huge that a large body of algorithm and tool work now
exists to exploit it.
- Multithreading is an processor architecture technique, originally
deployed in I/O channel processors, but directed at high performance computing
by Burton Smith at Denelcor and Tera. If a problem can be expressed as a
parallel program, parallel program elements can be run concurrently on the
same processor, with each thread getting some subset of processor cycles. If
each of T threads gets to issue a single instruction every N cycles, the
absolute time difference between the generation of a memory reference by a
thread and the point at which the processor needs the value to avoid stalling
is multiplied by N, reducing the sensitivity to memory latency.
My favorite quote about computer architecture comes not from a mathematician or
an engineer, but from General Omar Bradley, who is cited as having said "Amateurs
talk strategy. Professionals talk logistics". To my mind,
computer architecture, at least in its turn-of-the-21st-century form, is no
longer about instruction execution strategy, but about instruction and operand
logistics.