Gould NPL

The following submission was kindly made by Jeff Lohman, jefflo@cyrix.com.

In 1982-83, shortly after being acquired by Gould Inc., the Computer Systems Division (formerly Systems Engineering Laboratories - SEL) began designing a new product line of superminicomputers, called, very originally, NPL. These were actually known as minisupercomputers, for they included a vector instruction set, and were intended to be "affordable supercomputers". Similar projects already were, or were soon to be, underway at Convex, Elxsi, and rumored at DEC.

Unlike Gould's existing product line, the Concept and PowerNode 32/67 and 32/87 machines, NPL was to have a Unix derived operating system only, UTX/32, and was strictly intended for timesharing installations, not being targeted at the real-time marketplace, SEL's traditional niche. The real-time operating system, MPX/32, was not supposed to be ported to it; there were vague notions of developing a completely new RTOS or real-time UNIX in the future, which never came to fruition, at least for NPL.

NPL was composed of two successive but partially overlapped development projects. The NP1 project consisted of the first CPU design and the complete system design, which inherited no components from the existing product line. This began in 1983 and first shipped in 1987. The NP2 project consisted of the second CPU and memory module design, but otherwise utilized the NP1 system design and components. It began in 1986 when the NP1 project proceeded into the design verification phase, and had completed design verification in 1989 when it was canceled shortly after CSD was acquired by Encore, Inc.

Architecture

The CPU was connected to the system bus via a single 128KB 2-way set associative primary cache with LRU replacement. This was split into completely separate instruction and operand caches with dedicated address and 128-bit data ports sharing the common system bus interface, and 64-bit data read and write interfaces to the remainder of the CPU; this 64-bit execution unit data interface allowed a double precision vector load/execute/store chain to run at 1 element per clock by performing write combining, in the 4 by 128-bit non-forwarding write buffer, and 128-bit operand cache reads and writes on alternate cycles. Each cache was self-blocking and virtually addressed for reads and fills while the write through, non write allocating, operand cache was physically addressed for writes and invalidates, which were also handled by snooping the system bus with a dedicated port on physically indexed directories;virtual addressing was selected in spite of its complexity for expandability and to keep the pipeline short. Synonyms were not permitted in the operand cache, being detected and invalidated when a line was allocated; the virtual indexes supported background invalidation upon a context switch, which could invalidate user space only if desired. The line size was 128 bytes divided into two sectors; in general a miss resulted in filling the missing sector only, unless it was a vector access to the first sector.

Instructions were prefetched from the instruction cache into a ping-pong 3 deep 64-bit instruction FIFO and 1 deep 64-bit branch target buffer, where one side contained the currently executing stream, while the other could hold the beginning of a predicted stream. Some predecoding was performed within the instruction FIFO. Branch prediction was static based upon instruction type; regardless of prediction, the branch target was always fetched into a BTB if it wasn't already present in a BTB. Decoding, but not execution, then proceeded down the predicted path until the branch was resolved in an Execution stage. The PC was maintained as one of two fetch pointers for each side of the instruction FIFO and two subtractive displacement values from these to the PC of the instruction in the Access stage.

From this instruction FIFO, instructions proceeded into the first stage of a five stage pipeline: Decode 1, Decode 2, Access, Execute, and Writeback. Decode1 was unique in that it could contain 2 instructions in parallel, for these reasons:

Detect many more forward chaining opportunities by widening the decode window.
Permit superscalar issue and execution of scalar store instructions.
Reduce mispredict penalty by allowing the original stream to be retained in decode while the predicted stream advances parallel to it.

Consequently the Decode 1 stage operated in one of two modes: sequential mode containing 1 to 2 instructions, or predicted mode containing 1 instruction from the original and predicted streams. Instruction decode and address calculation,with RAW forwarding, occurred during Decode 1 for these 1-2 instructions. Normally, 1 instruction advanced from this stage into Decode 2, where operand cache access, address translation and generation of microcode starting address occurred. The contents of Decode 2 then advanced to the Access stage where operands were returned from cache or the instruction stream to the execution unit, which read its file operands, while the first microinstruction was read from microstore. A cache miss on an operand fetch resulted in the extension of the Access stage via a pipeline stall until the operand was returned to the execution unit.

In addition to normal scalar issue, 2-way superscalar issue could be achieved in some cases when Decode 1 advanced into Decode 2 if one of the instructions was a scalar store which was compatible with the second instruction in Decode 1.This was possible because scalar stores were executed by the instruction decode unit, which maintained a coherent copy of the scalar registers supplying the data for the store, and because this unit collected results from the execution units before supplying memory write data to the write buffers in cache. Also vector chaining of 3 or 4 instructions within Decode 1 and 2 could also result in superscalar issue effectively.

The non-blocking TLB consisted of a 256 location demand loaded buffer for STEs and a 2K location split instruction and operand 2-way set associative cache for PTEs with LRU replacement and 4 PTE line size. Both STEs and PTEs had firmware initiated background invalidates, with the STEs and PTEs being filled by firmware and hardware driven tablewalks respectively.

The microstore was an 8Kx120 RAM addressed by either the instruction derived initial address, next address field of the microinstruction, 8 deep stack, or an exception/reset address, each capable of conditionally branching up to 4 ways, including calls and returns. The integer execution unit performed all integer instructions except multiplies, and all divide instructions. It consisted of a single cycle 64-bit general purpose ALU and shifter augmented with quotient digit generation for integer and floating-point division; it was also largely responsible for addressing both the integer and floating-point unit register files, maintaining the condition codes, and the portion of VR1 which functioned as VC.

The floating-point execution unit performed all floating-point instructions, except division, and integer multiplies. It provided hardwired and pipelined add/subtract/convert and multiply/reciprocal functional units with complete 64-bit data paths, and a hardware implementation of reciprocal approximation using Newton-Raphson iteration. The functional units had latencies of 2 and 3/4(single/double) clocks respectively, with throughput of 2 for single precision,1 for double precision, except multiply and reciprocal which had throughput of 1/2 for double. These two pipelined functional units could be chained together when their vector instructions had a common destination vector register.

Vector instructions were implemented using microprogrammed iteration over the elements, in a pipelined manner if the element operation was pipelined, which was the case for nearly everything. Both execution units maintained a 3 deep by 64-bit operand prefetch FIFO to capture memory operands and immediates as well as dual read ported copies of the GPRs, BRs, and VRs in the integer unit and quad read ported copies in the floating-point execution unit. Likewise, the instruction decoder maintained a quad read ported copy of GPR1-7 and BR1-7 to perform the address calculations of two instructions. File coherence was maintained by snooping the 64-bit execution unit result buses over which both register and memory write data was transferred. RAW forwarding was supported by all of the files for the GPRs and BRs, but not supported for VRs, which was avoided by special handling of the degenerate short vector cases in firmware; however, the floating-point execution unit incurred a 1 clock register reuse penalty for RAW conflicts between adjacent instructions.

NP2 not only supported the synthesis of vector register/memory instructions performed in NP1, but also could chain a vector store to just about any matching precision vector instruction with identical register operand, as well as nearly all combinations two vector register/register instructions utilizing different floating-point unit functional units and dependent destination register operands.

Some additional differentiating aspects of NP2 were:

All RAMs, not just some, were parity protected.
The valid bits in the cache directories were redundant.
Execution of single precision scalars made use of the 64-bit data paths to perform a redundant calculation with comparison of the 32-bit results, causing a machine check trap if they mismatched.
A cache coherent write bus was added to the cabinet to move purely invalidate generated memory write traffic off of the system bus.

Technology

The NP2 CPU was also completely ECL, but otherwise utilized completely different technology than the NP1 system components, and it was predominantly implemented with ASICs, unlike the NP1 CPU. These were Motorola MCA2500 (about 2.5K gates) and MCA1500M (1.5K gates plus 1Kb of RAM). The static ECL RAMs and the limited amount of SSI were Fairchild 100K surface mounted onto small PCBs called "modules", which were thru-hole mounted onto the main PCBs just like the sockets for the ASICs. The PCBs remained 17" by 20", but with 20 layers, of which 12 were 50 ohm signal layers.

An NP2 CPU consisted of 4 PCBs: I, E, F, and C, consuming 4 physical slots in the backplane occupying the same space as the 7 physical slots for NP1, and utilizing 6 row connectors instead of 4 row, as well as more interconnect over the foreplane connectors. The E board contained the microstore, sequencer, the integer execution unit, control unit serial interface and on board instrumentation similar to NP1. The F board contained the floating-point execution unit, which the I board had the instruction FIFO, decode, and address calculation, and TLB. The C contained the cache and system bus interface. Again, LSSD was not provided, and the same technique of providing visibility of major state items, while clocks were stopped, back to the control unit over the serial bus was utilized. Cooling remained forced air, and the NP2 CPU had an average power dissipation of 2500W; that's not a misprint either.

CAD/DV Software

While the Daisy schematic editor was still used to enter ASIC schematics, simulation as opposed to prototype checkout was the primary design verification vehicle, since the extensive use of ASICs made the former approach impractical. A proprietary RTL simulator was used for interactive simulation and debug of a full CPU model, achieving 0.5 to 1 cycle/second; a Zycad simulation accelerator was used for regression. A proprietary technology mapper was also used to some extent to generate ASIC schematics from equations. These were augmented with a proprietary static timing analyzer. The physical CAD, prototype and production verification processes were similar to that used for NP1, with the significant difference that design changes usually involved ASIC changes.

Strong Points

NPL was a relatively clean vector ISA with robust multiprocessing support, particularly with the NP2 extensions. NP1 was a solid implementation of it,accelerating vector operations somewhat over what would have been possible with the scalar ISA along; NP2 shared these characteristics, but was a much better implementation of a vector machine with a single main memory access port, which was ultimately the performance limiting factor, along with the predominantly scalar instruction execution model. The vector ISA was obtained at comparatively small cost, comprising mainly the larger register files, microstore, and supporting addressing structures; being for the most part scalar issue machines implemented with discrete SRAMs, this did not have a large adverse cycle time impact.

NP1 offered similar performance as the Convex machines with the Arithmetic Accelerator, having a higher clock rate due to the technology, but being less aggressive architecturally, as it could not operate on two single precision numbers per clock, which Convex could. NP2 addressed that weakness and was even more ambitious architecturally and electrically; unfortunately, it never shipped in volume. Neither could be considered a very successful product from a business standpoint, as the total installed base of NP1 was on the order of 100s of machines. However, SEL and Gould never received credit at the time or subsequently for successfully utilizing ECL repeatedly to achieve significantly higher frequencies or implementing multiprocessing machines long before their primary competition in this area, Convex and DEC.

Weak Points

NPL's four primary weaknesses were its floating-point arithmetic, the I/O system, the interrupt system, and the compiler. The proprietary Gould floating-point became an growing liability as the IEEE-754 standard became an increasingly adopted standard concurrently with the development of NP1 and NP2, to the point where future NPL machines would have had an IEEE-754 operating mode. The I/O system was elegant, but also slow due to the substantial overhead involved in moving data between semaphore protected lists; this sometimes caused I/O throughput to dip below the capabilities of the device I/O buses. Similarly, the interrupt system had even more semaphore protected hot spots and was by design incapable of ensuring real time response.Finally, the quality of compiler generated code consistently lagged behind the capabilities of the CPUs, which eroded NPL's sole competitive advantage with respect to Convex.

Strategically, the NP1 CPU was not ambitious enough given the amount of time and money required to undertake it. When it arrived in the marketplace, it's advantages were insufficiently compelling enough for it to overcome it's already established nearest competitor, Convex. It was said that Bill Ward really wanted to build NP2 first, but it was deemed too risky as the required technology, MCA2500, was very immature in the 1982-83 time frame, and SEL/Gould had no ASIC expertise. If that was the case, it would be a tough decision to make even with the benefit of hind sight.

Lessons Learned

NPL's fundamental failing was that it was not the first or a compelling entry into a new company market and simultaneously was not a desirable product in the company's existing real-time market. Furthermore, it was by design not adaptable to the existing market, a situation which was allowed to occur by the lack of an ongoing RTOS development effort for NPL. Consequently, the small installed base and difficulties experienced by the sales force are not surprising.

Photo

If anyone has one, please contact KevinK@acm.org.

Acknowledgements

The author gratefully acknowledges the following individuals, listed in alphabetical order, who participated in the NPL project and contributed to this synopsis or reviewed it:

Gould NPL

Architecture

Technology

CAD/DV Software

Strong Points

Weak Points

Lessons Learned

Photo

Acknowledgements

Doug Beard

Greg Brinson

Bryan Hornung

Mark Lipford

Jeff Lohman

Keith Shaw

Mark Shaw

Alex Silbey

Mike Wastlick

Back to Dead Supercomputer Project Index