The thrust of the analysis was to provide credible evidence to support architectural and implemen tation trade-offs. The major areas of focus were • Memory system organization
• Printer intnface performance
• Main memory bandwidth
• Overal l system performance
Memory Syste-m Organization The statistical anal ysis of the trace information provided many clues to direct our investigation toward the optimum mcmorv system archi tecture. The overa ll read-to write ratio for the observed benchmarks ranged from as low as 4.5: 1 up to '5.'5: 1 , which me:1ns for a write-through cache system with a theoretical 100 percent read h i t rate, the writes would degrade
64
the overall hit rate to approximately 81 to 84 per cent. As the ana lysis of the data progressed, i t was understood that the write data must be studied very closely since it cou ld have a dramatic i mpact on the overal l cache miss rate. During the cache model simulations, the hit rates of the 1-stream were between 85 to 90 percent. However, the D-stream h i t rates were between 35 to 45 percent, with writes accounting for 60 to 90 percen t of the total D-stream misses. To achieve the greatest posi tive effect on the hit rate of the system, enhance ment of write-miss performance was the most advantageous. The two options to improve this per formann: were either to implement a wri te-back cache or to add a write buffer to the system. Further cache simulations showed that a write buffer would provide an 8 to 16 percent overa l l system perfor mance improvement, which was equal to that of a write-back cache. The wri te buffer, however, was the more stra ightforward solution to implement. Cache analys is revealed that the processors required different memory architectures. The CVA.X had an internal l KB, two-way set associative cache. This was to be configured as a mixed 1-and D-stream cache. An additionai 32KH ro 64KB, two-cycle write through cache was to be added external ly. This would also be configured as a mixed I-and D-stream cache. A si ngl.e- longword, two-cycle write buffer would provide enough buffering to reduce the dramatic impact of write misses. The soc was proposed to have an internal write-back cache between 5 KB and 8KB, with each l KB region mak ing up a single set. Cache simulations indicated that with a minimum internal mixed 1-and 0-stream cache of 5KB, five-way set associative, an externa l data cache would have to be over 64KB to have even a negligible effect on ovnal l system performance. Therefore no external cache was recom mended. To mi tigate the write-miss penalty, a two-cycle write buffer of 4 to 6 longwords was recommended.
As an acceleration technique, the original PrintServer 20 control ler contai ned a memory access capability that a llowed data written to mem ory to be logical.ly ORed with data that was al ready stored. This technique was particularly useful when the software system was wri ting the image that was u ltimately printed. As part of the process of gener ating an image to print, the individual characters appearing on a page must be copied from a region of memory ca l led the font cache to another region cal led the frame buffer. The frame buffer contains the actual data that is sent to the print engine.
To complicate things, the data written to the frame buffer must be able to overlay data that may al ready be there, thus requi ring a logical OR function.
When a document was printing at or near the maxim u m engine speed of 20 pages per m inute, analysis showed this low-level copying function consumed approximately 20 p ercent of the total system time al lotted to generate and print one page. Thus a logical OR fu nction in the memory system would reduce the number of memory data cycles from "2 reads 1 write" to " 1 read 1 write," and reduce the impact from a second read occupying a useful cache l ocation. Without this capabil i ty, the degrada tion would be between 5 and 10 percent of overal l system performance when printing at or near 20 pages per m inute. Therefore memory capa bil i ty with a logical OR function was recommended.
Printer interface Performance When a PrintServer 20 is printing, every page that ex i ts the printer requi res the 1 MB frame bu ffer to be copied from memory to the print engine interface. Changing a program-controlled p rinter interface to one d riven by a DMA device provided two significant advan tages. The first was to reduce the rea l- time requ ire ments on the PrintServer software system, and the second was to allow for a limitecl degree of paral lelism on the controller. The para l lel ism was due to the abil ity of the processor to co ntinue to execute from its cache memory system while the DMA device accessed memory. The processor only stops executing when a cache m iss occurs.
Main Memory Bandwidth With a CVAX processor configured as recommended in the section Memory System Organization, the main memory system bandwidth requirement of the processor was 60 percent. For the soc, it was 70 p ercent when an existing DRAl\1 controller was used. A Dl\'lA driven printer interface requi red 15 percent, and an Ethernet interface required nom inal ly 4 percent with bursts up to 20 p ercent. Each subsystem was scrutinized to reduce its required memory band width. The resulting recommendation was to add a 32-bit bus to the memory subsystem to provide a dedica ted channel for a l l data being sent to the printer interface. This provision would reduce required memory bandwidth for the printer inter face from 15 percent to about 7 percent. The sys tem would then have a nominal memo1-y bandwidth requirement of 71 percent for a CVAX system and 81 percent for an soc.
Digital Tecbtticaljournal Vol. 3 No. 4 Fa/1 1')')1
Design of the Turbo PrintServer 20 Controller
Overall System Performance The execution char acteristics of the original PrintServer 20 provided some interesting surprises. Most floating-point calculations were performed in double precision; and even more interesting, for each floating-point operation, there was a floating-point conversion from si ngle to double precision, and then back again. Since the precise operations were not requ i red, a simple compi ler switch removed the conversions and provided a 3 percent overa ll sys tem p erformance improvement for floating-point intensive PostScript documents. A second surprise came from the resu lts of the BM3 benchmark, which indicated a translation buffer hit rate of
85 percent. At the time of the discovery, the PrintServer 20 was configured with a standard M icroVA..'( processor; however, by substituting an rtVAX, which uses one less memory access to refer ence its page tables, an 1 1 percent system per formance improvement was achieved. With this improvement, the rtVA.c'C processor provided enough power to al low the original PrintServer 20 to ship with its 20-page-per-minute designation. This information led the turbo controller designers to determine that the transla tion buffer of the SOC
would be large enough for a l l the entries requ ired .
Results
The final analysis revealed that the expected perfor mance of a CVAX or SOC processor wou ld place either design on the low side of the performance requi rement. Therefore close attention to detail would be requ ired d ur ing the implementation phase of the project as every ounce of performance mattered . The expectation was to have a choice between an SOC processor with a 40-ns cycle time and a CVAX processor with a 60-ns cycle time. The performance i mprovements of the two processors are compared in Table 1 .
Table 1 Performance I mprovement Relative to Original Pri ntServer 20 Contro l ler
soc CVAX
Benchmark Processor Processor
8M1 4.7 3.7 8M2 4.9 4.0 8M3 4.3 3.3 HOUSE 4.9 4.2 SCHEM 4.7 3.7 65
Image Processing, Video Terminals, and Printer Technologies
As the project schedule progressed, the risk asso ciated with the new soc processor decreased. As this risk window col lapsed, i t was u nderstood that a turbo control ler based on the soc processor would not only perform better, but would also cost less as it would not require an external cache.
Turbo Controller Hardware Design
The turbo controller was destined for a relatively high-end printer. Therefore the hardware architec ture had to provide maximum performance, even though this implementation would increase costs. Based on the results obtained during RETrACE analy sis, the hardware design had the fo llowing imple mentation goals:• The SOC would provide the CPU, the floating point accelerator (FPA), and the cache subsystem. No second-level cache would be implemented. • A four- to six-entry write buffer would be
implemented.
• The transfer of bit-map data to the print engine would require a 32-bit DMA subsystem with scan erase capability.
• The memory subsystem would support OR-mode memory access by the CPU and scan-erase access by the DMA controller.
Although both the SOC and rtVAX chips comply with the VAX arch itecture standard and both are conceptually very similar, they have significant dif ferences in the bus interface. For example, the soc uses a quadword cycle (one 32-bit address fol lowed by two 32-bit data reads) to fill one i nternal cache block, while the rtV�'\ processor, which does not support caching, does not usc this type of cycle. AJso, the clocking system on the SOC was enhanced, and the timing relationships between signals were modified to improve performance.
The changes to the SOC bus in terfact:, plus the required functional changes revealed by RETrACE analysis, meant that very l ittle of the original PrintServer 20 controller design cou ld be appl ied to the new controller. One of the first questions to be answered before the design of the turbo con tro l ler could begin, was whether or not one or more AS!Cs wou ld be required for the design. This question had to be answered for three subsystems: