Operational experiences with the TI Advanced Scientific Computer

Operational experiences with the TI Advanced Scientific Computerby W. J. WATSON and H. M. CARR Texas Instruments Incorporated Austin, Texas INTRODUCTION CENTRAL :\1EMORY Since 1966 a large computer development program has been conducted by Texas Instruments. The goal for this effort was to provide needed capacity for supporting seismic processing, plus offering a general purpose capability for large scientific problems. This development has resulted in the Advanced Scientific Computer (ASC)-a highly modular system offering a ",ide spectrum of processor power, memory sizes, and I/O capability. The ASC is a high-speed, large-scale processing system featuring extensive use of pipelining, multiple arithmetic units, separate control processors, large and fast central memory, and extensive user software aids. The central processor has both scalar and vector instruction capabilities. First delivered in 1972 and placed into operational status during 1973, several operational ASC systems now offer extremely high processing rates for particular classes of problems. The ASC central memory consists of a memory control unit (MCU) and appropriately sized modules of high-speed or medium-speed central memory. Optionally, a medium-speed central memory extension can be used in conjunction with a high-speed memory. The MCU is organized as a two-way, 256-bit/channel (8-word) parallel access traffic net between eight independent processor ports and nine memory buses, with each processor port having full accessibility to all memories. The nine memory buses are organized to provide eight-way interleaving for the first eight buses with the ninth bus used for the central memory extension. The MCU provides the facilities for controlling access from the eight processor ports to a CM having a 24-bit address space (16 million words). A port expander can be utilized to expand the number of processor ports. Figure 2 illustrates this structure. . The semiconductor high-speed central memory modules have a cycle time of 160 ns and a read time of 140 ns. Additionally, all transfers are 256 bits (eight 32-bit words) with a Hamming code providing single-bit error correction and double-bit error detection for each 32-bit word. Highspeed central memory is typically divided into eight equalsized modules which allow for eight-way interleaving. OVERVIEW OF THE SYSTEM The major subsystems of a typical configuration are shown in Figure 1: the central memory, the central processor, the peripheral processor, on-line bulk storage, a digital communications interface, plus a selection of standard peripherals. The peripheral processor has been designed for executing the operating system. The central processor has been designed expressly to provide high computing speeds when operating upon large arrays of data. The central processor operates as a slave to the peripheral processor. This design approach was chosen to maximize the overlapping of system overhead tasks with the execution of user programs. In operation the job stream is analyzed by the peripheral processor. The language processors, plus user object code, are executed by the central processor. System control and I/O tasks are processed by the peripheral processor. I/O is routed through high-speed, head-per-track disc storage. A data communications interface for the common carriers is provided for the support of remote batch and interactive terminals. Standard types of peripherals are also provided. The central memory serves as the common communications and access storage medium for these subsystems. CPITRAL PROCESSOR (CP) PERIPHERAL PROCESSOR (PP) CENTRAL MEMORY DISC STORAGE DATA COMMUNICATIONS ~ COMON CARRIERS PER IPHERALS Figure 1-Major ASC subsystems 389 From the collection of the Computer History Museum (www.computerhistory.org) l\Iany combinations of test and branching instructions with incrementing or decrementing capability are also available. The vector capabilities of the CP are made available through the use of VECTL (vector after loading vector parameter file) and VECT (assumes parameter file is already loaded) instructions. Subroutine.or 4-pipeline CP's. magnitude. Stacking and modifying arithmetic registers can be done with single instructions. These units employ the pipeline concept in both scalar and vector modes.L -----. matrix multiplication. and divide for halfword (16-bit) and fullword (32-bit) fixed point numbers and fullword and doubleword (64-bit) floating point numbers. multiply. eight index registers. The basic instruction size is 32 bits. sixteen arithmetic registers.org) . The CP has 48 program-addressable registers. 1974 SECONDARY MEr1lRY ACCESS PORTS INTERLEAVED HIGH-SPEED OR MED lUM-SPEED MEMORY MODULES r1E~()PY CONTROL UNIT PRIMARY MEMORY ACCESS PORTS (MCU) r------7---------.390 National Computer Conference. Memory mapping registers and protection registers are used to facilitate central memory management and access control of the ports. \. or four execution units or "pipes" can be provided. The basic structure of the CP. subtract. vector dot product. and doubleword instructions. the memory buffer unit (MBU) to provide operand interfacing with the central memory. as well as normalize instructions. One important characteristic of the vector instruction capability is the ability to encompass three dimensions of addressability within a single vector instruction. I I 9 Ti: 1~·/:661 : MBU MBU ~ I I I f3~: L ________ =.::. and negative operand capabilities. Figure 3 shows a CP diagram for 2. This group of 32-bit registers consists of sixteen base address registers. Peak Pick. Vector instructions are also available for shifting. among others. Scalar logical instructions are provided as are arithmetic. The vector repertoire includes such arithmetic operations as add.~ .! : AU TWO-PIPFLINE CP AU FOUP-PIPFL INE" CP :. shown in Figure 3.J r---------.. and others for both fixed point and fl'oating point representations. in addition. PRIMARY MEMORY PORTS ~ { ::I //1//TI'1'\. Format conversion for single and doublewords. CENTRAL PROCESSOR The central processor provides both scalar (single operand) and vector (array) instructions at the machine level. logical operations. one memory port for each pipeline (MBU-AU pair) in a CPo A significant feature of the CP hardware is an operand look-ahead capability which causes memory references to be requested prior to the time of actual need. are available. Order. and special operations-such as l\Ierge. Arithmetic scalars include various adds. A single execution unit can have up to twelve scalar instruction in process at one time. logical. The single instruction stream. full word .1 :M"~~~6~L EXTENSION (OPTIONAl) ----------~---------INTERLEAVED MEDIUr1-SPEED MEMORY MODULES Figure 2-Modular structure of the ASC central memory The optional central memory extension allows large amounts of medium speed memory (1 p's semiconductor technology) to be used in the normal address space of central memory. From one to four vector results can be produced every 60 ns. Various comparison instructions and combination comparison-logical instructions are provided for halfword. The central processor design is such that one. depending on the number of execution units provided. which contains a mixture of scalar and vector instructions. each with a corresponding number of MBU-AU pairs. The· CP scalar instruction repertoire includes an extensive set of load and store instructions: halfword. with immediate. linkage is accomplished through branch and load instructions. 32-. Block transfer between memory extension and high-speed memory is controlled by the peripheral processor and will transfer at a rate of 40 M words per second.c Figure 3-Basic structure of the CP From the collection of the Computer History Museum (www. and an arithmetic unit (AU) to perform the specified arithmetic or logical operations. divide. This last group is used to extend the instruction format for the complete specification of vector instructions. and circular shifts. comparisons. Double buffering r-----l PRIMARY MEMORY PORTS {~ ~ I /\ / \ i$$ ~clJ L _____ . fullword. two. is preprocessed by the instruction processing unit. Ability to load and store register files and to load effective addre:sses is also available. has three major components: the instruction processing unit (IPU) for non-arithmetic stages of instruction processing for the CP instruction stream. three. subtract. or 64-bit operands. and eight vector parameter registers.computerhistory. format conversions i normalization. Search. multiply. Note that a memory port is required for the IPU and. with 16-. Select and Replace.. This is equivalent to a nest of three indexing loops in a conventional machine. and doublewords. and eight levels are contained in each arithmetic pipeline (MBU-AU pair). In these circumstances various combinations of the components of the AU are From the collection of the Computer History Museum (www. Several features are provided to alleviate the potential problems of branches and instruction dependencies in the instruction pipeline. and (8) output. one octet per buffer. (7) accumulate. Figure 4 shows how different sections of the AU are utilized for execution of particular instructions. -. each having a 60 ns basic cycle time._-1 OUTPUT • I ~. ~. called the "X" and "Y" buffers for input and the "Z" buffers for output. The MBU has three double buffers. (5) normalize. The pipelined AU achieves its highest sustained flow rate in the vector mode. RESULT RESULT Figure 4-Arithmetic unit pipeline Arithmetic unit The primary function of a CP arithmetic unit (AU) is to perform the arithmetic operations specified by the operation code of the instruction currently at the AU level. Vector processing is altered by software in order to distribute segments of the vector for multiple pipe systems. These eight sections are (1) receiver register. There is one AU per pipeline in the CP. Exceptions are double length multiply and all types of division... whether for fetching or storing. (3) align. The IPU performs routing of instructions to the MBU-AU pairs based on an optimum use of arithmetic unit capability.. each of which can provide an output every 60 ns. FIXED MULT •I ~ I RECEIVER REGISTER I L ___ Instruction processing unit The primary function of the instruction processing unit (IPU) is to supply a continuous stream of instructions for execution by the other parts of the CPo One Central Memory port is required to provide the instruction stream. " I I I I EXPONENT SUBTRACT I I ~r I I I I ALIGN I :--. or an av€rage of 15 ns per result for a 4-pipe central processor. An AU is a 64-bit parallel operating unit for most scalar and vector instructions. Instructions are transferred from memory in octets as are all other references to memory for fetching or storing of information. (4) add.1 • ADD NORMALIZE The memory buffer unit (MBU) provides an interface between central memory and the arithmetic unit.the arithmetic part of all instructions. i.Operational Experiences with the TI Advanced Scientific Computer FLOATING ADD in multiple 8-word (octet) buffers for each pipeline provides a smooth data flow to and from each arithmetic unit. are made in 8-word increments (octets).computerhistory. (2) exponent subtract. 391 . A distinguishing feature of an AU is the pipeline structure which allows efficient execution of . I ACCUMULATE • I I I I I I --.e.. Two 8-word (octet) buffers are utilized to achieve a balanced stream of instructions from memory to the IPU. Its primary function is to supply the arithmetic unit with a continuous stream of operands from memory and to provide for the storing of the results back to memory. This double buffering is provided so that pipeline processing can be sustained at a high rate with minimal memory access conflicts. (6) multiply. Up to 36 instructions in various stages of execution can be overlapped within the 4-pipe CPo There are twenty positions for instructions in the 2-pipe CP and twelve positions for instructions in the I-pipe CPo Four levels are contained within the IPU. floating point addition and fixed point multiplication.org) . typically a result each 60 ns per AU... All references to memory. I Memory buffer unit I I _. MULTIPLY • L___ -. There are eight exclusive partitions of the AU pipeline involved. and fullword (32 bits) operations. the first version of this software was placed into operational status v. Remote links are presently implemented with nonswitched. The CR file serves as the principal storage media for control information necessary for the coordination of all parts of the ASC system. such as polling loops. Each VP has its own program counter along with arithmetic.computerhistory. therefore. The instruction format is similar to that of the central processor. The eight VP's share a read only memory. an arithmetic unit. and remote concentrators. PERIPHERALS Standard types of magnetic tape drives.000 bits per second. This combination of hardware and soft"rare pro"\rides a 'Ter~l high effecti'le transfer rate. using a 32-bit word for each instruction. more than one clock cycle is required to complete these arithmetic operations. and printers have been interfaced with the ASC.ill average approximately 5 ns which results in an exceptionally fast "effective" transfer rate. and a central memory buffer. A subset of the system's peripherals can also be interfaced via the communications register file. 1974 utilized. externally flexible in the various devices which may be utilized for communication with the ASC. DISC STORAGE Disc storage is the principal secondary storage system for the ASC system. base. and higher level programming languages implemented on the systems supporting Texas Instruments' Corporate Information Center. The PP is a collection of eight individual processors called virtual processors (VP's).LS. card equipment. metaassemblers. byte (8 bits). Several aspects of the implementation of the peripheral processor concept greatly increase the effectiveness of the ASC system. When an equally distributed sequence of time units is used. access time ". Using the shortest-access-time-first algorithm.4 J. Texas Instruments has From the collection of the Computer History Museum (www. and instruction registers. The distribution of available time units can be dynamically varied to suit particular processing requirements. Disc storage consists of head-per-track (HIT) disc systems supplemented by positioning-arm disc (PAD) systems. second. The 4K 32-bit words of read only memory within the PP is utilized for program storage and execution of those short routines which are highly utilized by the VP's. The data communication system supports transfer rates up to a maximum of 240. The typical PP instruction requires two 85 ns cycles for completion. Therefore. D:ata communications are controlled by a data concentrator which. The HIT disc system is a high-performance device whose effective performance is further enhanced because the operating system utilizes a shortest-access-time-first (SA TF) algorithm for data transfers.392 National Computer Conference. The major software capabilities are discussed in the next few paragraphs with emphasis being given to those attributes "\vhich provide comprehensive and flexible programming facilities for the user. each of the eight VP's receives two 85 ns cycles every 1. thus. Instructions are provided for bit (1 bit). The data communications system presently supports communication with three types of stations: high-performance user terminals. full duplex common carrier data transmission facilities. the instruction set is oriented toward control operations and does not require multiplication. or floating point operations. a single copy of reentrant code can be executed simultaneously by more than one VP. The system can be easily extended to support smaller terminals down to the teletype level. Data is transferred over these links synchronously at rates determined by the modems and common carrier bandwidths. The data concentrator is a TI-980A minicomputer equipped with special-purpose hardware communication interface units on its direct memory access ports. Each HIT disc module has a capacity of 25 million 32-bit words with a transfer rate of approximately 500K words per The data communication system is very modular and. ASC Fortran language The most obvious interface between the ASC system and a user is "'. This was accomplished through the use of simulators. Use of the common units is distributed among the VP's using sixteen single 85 ns cycles. These stations may be either remote or local. THE PERIPHERAL PROCESSOR DATA COMMUNICATIONS The peripheral processor (PP) is a powerful multiprocessor designed to perform the control and data management functions of the ASC. The communications register (CR) file contains sixty-four 32-bit word registers which are program addressable by the VP's.org) . Thus. SYSTEM SOFTWARE Software design and development for the ASC system has progressed in parallel with development of the hardware.rith the translation of the user-written program into machine level instructions that efficiently utilize the special hardware features in the system. division. interfaces to the ~ICU through a channel control device. in turn. and. Each VP has direct access to the entire central memory for program execution and data storage. These interfaces attach to primary or secondary memory ports through a variety of standard selected and multiplexed data channels. halfword (16 bits). index. an instruction processing unit. other large computers. Because the PP is intended to perform control functions rather than execute mathematical algorithms.rith the ASC prototype machine. data transfer -vvithin the system. The user may specify and control a job without detailed knowledge of the Operating System. Scalar operations are reordered wherever possible without affecting results. The optimizing algorithms encompass such areas as conventional optimization. The vectorized math function subprograms exploit the vector instruction set of the ASC. An executing job can initiate a deferred job. To provide the programmer direct access to the specialized vector instructions. It allows the user to specify the programs to be executed. The ASC's Fortran language is an extension of ANS Fortran. two. Its primary function is to trallslate Fortran code into object code which will execute the program in the shortest possible time. software package that will produce code acceptable to a central processor with one.computerhistory. This is not to provide unique access to hardware features. The scalar function subprograms include all of the single and double precision functions traditionally provided in Fortran libraries. The language provides JSL variables which allow the programmer to pass control information to and among CP programs at execution time. JSL control statements can be used to test these variables to determine the programs to be executed next. Wherever possible. file management services. Both the scalar and the vectorized math function subprograms can be used by the Fortran and assembly language progrfullmer. The result of this effort is the ASC NX Compiler. In particular. so as to minimize both pipeline and memory reference delays. and other system services in a straightforward manner. a highly optimizing. default conditions have been built into this language so that only a minimum specification need be given by the user. the optimizing task is accomplished by performing optimization on the source program logic and on the object code instructions produced. if any. Linkage editor The ASC Linkage Editor creates a load module for execution by linking separately assembled or compiled object modules obtained from the job input stream. Vector instructions are used where feasible. Because the ASC has both scalar and vector instructions. In general. between individual programs of a job. the data files to be made available. The Job Specification Language is composed of job definition statements.org) . the decision to do so could be based on the value of a JSL variable within the executing job. Assembler The ASC Assembler is a meta-assembler or translator which facilitates symbolic coding of the ASC Central or Peripheral Processors at the instruction level. array intrinsic and array generation intrinsic functions are provided. this action may be 393 overridden by the use of a Fortran compiler specification option. cataloging statements. but to simplify the programming required for complex problems. file processing statements. instruction scheduling and vector generation with optimization. crosssections of arrays or subarrays. The added language features permit the ASC Fortran programmer to define and use subarrays. and array intrinsic functions. Linking is accomplished by relocation. Operating system The ASC General Purpose Operating System (GPaS) schedules and allocates system resources in response to user service requests in a multiprogramming environment. the dependencies. GPaS provides input/output service. The Fortran compiler employs the vectorized math subprograms to replace multiple calls to a scalar subprogram when possible. Job specification language The Job Specification Language (JSL) is a user-oriented language. and by allocating virtual memory. and macro definition statements. and various cataloging and data management functions which may be specified. It is an extendible (macro facility). programmable specification language rather than a set of control cards. The philosophy has been to provide many explicit statements with relatively few parameters for each. the compiler provides a complete set of informative messages regarding applied optimization procedures and where source program logic prevents optimization. array assignment statements. it contains all of the ANS Fortran mathematical functions and all of the IBM S/360 Fortran mathematical functions. The ASC Fortran compiler produces highly optimized obj ect code with complete diagnostic analysis and messages. Mathematical library The ASC Mathematical Subprogram Library is unique in that it uses both scalar and vector capabilities. by resolving external references. The utility and accessibility of the Central Processor to user programs is increased by From the collection of the Computer History Museum (www. user oriented. however. the compiler has the capability to recognize array-oriented operations specified in standard Fortran and to generate the equivalent vector instructions to perform the required operations. In addition. The evaluation is effected by a sequence of vector instruction executions. user libraries or system libraries. three or four pipelines (arithmetic units). A single call to a vectorized math function subprogram causes that function to be evaluated for the entire vector of arguments. The ASC Fortran compiler was designed to meet the demands of the professional programmer. rather than a few statements with many operand fields that provide all functions. program processing statements.Operational Experiences with the TI Advanced Scientific Computer attempted to make this interface a smooth one by effort invested in compiler techniques. several seismic interactive terminals are interfaced both locally and remotely to this system.. a complement of head-per-track disc storage. System 1 was operational early in 1973 and is currently being devoted to software development and support of application program conversion to the ASC. Holland. much magnetic tape input and output. a data communications interface. In addition to the high computational speeds available on the ASC~ the seiswic center experience is shmving that other ASC features are valuable when applied to this application. ASC #1 is configured with a one-pipe central processor. many job steps composed of long computational sequences.394 National Computer Conference. 11:J=H~:tctE ~ND DISC INTERFACE UNIT HjT 25M WORDS 500K WORDS/SEC.J TAPE SWITCHING UNIT 6 DUAL DENSITY 9 TRACK 800 1600 BPI TAPE DRIVES } TAPE CONTROLLER } CHANNEL NUMBER 1 SECONDARY STORAG 3 DUAL DENSITY 7 TRACK 556 800 BPI TAPE DRIVES CHANNEL NUMBER 2 SECONDARY STORAGE (A) 114219B Figure 5-GFDL ASe configuration GPOS performing all overhead functions in the Peripheral Processor.--.. Most important of these features is complete access to central memory by the PP. and resource allocation algorithms for ease in "tuning" the system to match the specific requirements of each installation. From the collection of the Computer History Museum (www. CARD READER THREE 1200 LINE MIN.. The system (Serial #1) was available for use as a software development tool and for customer demonstrations for the remainder of 1971.org) . The overall system architecture is maintained to accommodate hardware and software system growth and flexibility. LINE PRINTER TWO 100 CARD MIN. In 1972 the prototype was moved to a permanent location at the TI facility in Austin. OPERATIONAL HISTORY The prototype ASC initially completed its checkout during the Spring of 1971.computerhistory. minimizes the system use of central memory with a small resident system and the remainder of the system non-resident. while common access to the rest of this file supports communication between the processors and other system components. The operating system isolates the control. scheduling. H 25M WORDS 25M WORDS 500K WORDS/SEC. a single reentrant copy of code is available to all processors. and the need to precisely control a complicated series of such jobs. 1974 E X P A N D E R M E M 0 R Y H/1:g~~tttE~ND DISC INTERFACE UNIT HIT H/~O~~~'tt~:ND DISC INTERFACE UNIT HIT 25M WORDS 500K WORDS/SEC...CP. 128K words of high-speed central memory. GPOS. The Communications Register (CR) file is used to allow one VP to control the other seven. 128K words of memory extension.. During the period of downtime. Seismic operational requirements are characterized by large data bases. a retrofit of the hardware was carried out to incorporate the latest version of circuits and boards and to support a production environment. Thus. and. only a branch instruction is needed to switch a Virtual Processor from one function to another.. PUNCHES OPERATOR COMM. The design of GPOS exploits hardware features unique to the ASC.. Experience with an ASC operating in a center devoted to seismic production work is currently being gained in the TI facility at Amstelveen. Hi1:g:ir\~CtE ~ND DISC INTERFACE UNIT HIT SOOK WORDS/SEC. This system (Serial #2) was delivered early in 1973 and essentially duplicates the capabilities described for the prototype machine. I I I I I I I I TWO 1500 CARD MIN. by its simplicity and modular design. plus standard tape and paper devices. TEXT EDITING CRTS (TWo) r . TWO CRTS . Additionally. Army in ~he Sum~er of 1973. text editing terminals. Although particular sequences of code can be found wherein hand coding will improve the speed of execution. B2) (#3FO. ASC extensions to the Fortran are sometimes found to be useful. J. K. J. Programs compiled for one-pipe ASC's will execute correctly on multiple-pipe systems.5 TXY(K.50 1=1. Some typical examples of efficient code produced from present applications \\1. K) (2) Z=X*Y (3) VECTL (#460. Improved productivity of geophysicists and geologists through real-time interactive sessions is ?ei~g achieved. and standard magnetic tape and paper devices. B2) (#3C8.5 PXY(I. Results obtained from the system while undergoing final checkout at TI's facility showed the speeds available to be several times faster than other current computer systems. the burden is on the system instead of the user. The large central memory enables one to maintain ample data so that the central processor is utilized to a very high degree. J)+T(I.er is the intended use of the largest and fastest ASC to be built to date. compiler-generated object code is the best choice. Application to long-range prediction of the earth's weath.ll be increased via a recompilation for the multiple-pipe machine. J» * RDX(JC) PBXY(I.ses and scheduling by the dedicated virtual processors. Applications programs are written in standard Fortran. J)-PS(I. Operational experience has also been gamed from the application of the ASC to the U. the ASC is proving to be a valuable asset. The ASC system design allows easy user access to performance enhancement through the use of additional central processor "pipes. Performance \\1. B2) (#3CO. K.Operational Experiences with the TI Advanced Scientific Computer Head-per-track disc storage. K. It is to be used for research into processmg techmques employed in ballistic missile defense. J» * 0. sequences of heavy computational work using the data. The I/O and multiprogramming capabilities managed by the operating system resident in the peripheral processor also support high CP workloads. It is expected that the use of ASC for selSIillC processing capacity will continue to grow at ~ rapid rate. Delivery is scheduled for early in 1974. K. K) '" Y(I. From the collection of the Computer History Museum (www. J. K) =X(I. was delivered to the U." Compiler software is responsible for both the generation of vector instructions and the partitioning of these vector operations over multiple pipes. management of the data ba. K)=(PS(I+1. K)=(PS(I+1. Senal #3.50 Z(I. one million words of high-speed central memory. and Job control available via the JSL language appear to match the environment of seismic work.lO 100 1=1.S. J)+PS(I. B2) (#3DO. (2) is an alternate notation that could be used. two channels of high density secondary storage devices. Table I shows the type of instruction generated by the compiler from a typical triple-nested DO LOOP. For weather codes characterized by large data bases that are updated frequently. American National Standard Institute (ANS) Fortran is completely sufficient. Emphasis has been upon Fortran code generated by analysts and weather scientists instead of hand-optimized machine language. the compiler prevents illegal conditions by the use of directive instructions for the CP to operate in either parallel mode (FORK) or sequential mode (JOIN). K. B2) (#3EO. for the broad range of programs where much applications code is involved. The ASC is configured with a four-pipe central processor. (1) gives the Fortran source with three levels of indexing. The National Oceanic and Atmospheric Administration (NOAA) has contracted for an ASC (Serial #4) for its Geophysical Fluid Dynamics Laboratory at Princeton University. 50 J =1. The system is well supportmg the reqUIrements by .generating significant improvements in unit p~ocessing costs and by permitting new processing technologtes to be econ~mically feasible. K. B2) VMF 395 TABLE II-Vector Instructions Produced from Weather Code (1) DO DO 100 (2) 100 K=l. not to provide unique access to some hardware feature but to simplify notation involved in writing the program so that the programmer can deal more directly with the mathematics of the application. Protection of the user from vector hazard conditions is carried out by the compiler. head-per-track disc. This configuration is illustrated in Figure 5. For mixtures of vector instructions and for mixtures of scalars and vectors. J» * 0.144 TBXY(I. Thus. J)-T(I. B2) (#3E8. K)=(T(I+1. K)=(T(I+1. B2) VAF VMF VSF VMF VAF VMF VSF VMF MAXIMIZING PERFORMANCE Experience thus far has shown that for the applications that have been considered by ASC users the most costeffective performance is realizable when the capabilities of ASC Fortran and the optimizing compiler are used. K. B2) (#3D8. Gover~ent data-proc~s ing problem of ballistic missile defense. Extensive checks are made by hardware to protect the user from illegal scalar conditions that might occur. a one-~lpe ASC with a configuration similar to the previously descnbed systems. and no need has been found to supplement the available compiler opt~zation by a~ditional hand coding. and mathematical operations performed on long arrays of data.computerhistory.11 illustrate the optimization level provided by the system. Much experience has been gained using benchmark programs derived from weather models and the actual weather prediction codes themselves. Partitioning of scalar instructions for multiple pipes is carried out by the CP hardware. J) * RDX(JC) VECTL VECTL VECTL VECTL VECTL VECTL VECTL VECTL (#3B8.S. and (3) is the single vector instruction produced.org) . TABLE I-Simple Examples of Vectors (1) DO DO DO 10 10 10 10 K=l. and vector instructions are readily produced from this Fortran. K. MIP ratings are ambiguous at best. and present ASC timing with checkout not finalized has already demonstrated approximately 30 minutes. This does not include the startup overhead necessary to fill the pipelines with operands. All instructions are vectors. mcric:1n ~9fctcGrG on the Modeling ~!. 1973.2.speets of G6A logical Society.5 1.2 is a measure of the total system performance upon this program..5 2. Vol. but particular recognition From the collection of the Computer History Museum (www. a four-pipe CP will require approximately N3/4 times the clock rate in seconds. Increment and Test Branch). data found in the Bulletin of the American Meteorological Societyl is given in Table IV.computerhistory. Three levels of addressing changes are implied in this case.396 National Computer Conference. each of the systems listed is compared as to relative speed. BuJletin of the }. TE. 1974 TABLE III-ASC Maximum Performance Rate ASC IX (ONE AU) 64-BIT RESULTS/SEC RESULTS/SEC RESULTS/SEC 9. A powerful example of vector instruction capabilities is found in the use of the hardware-implemented dot-product operation. No scalar instructions are necessary in this example.3 X 10 6 4. Consider the performance of an ASC producing "results per second.5 3. ACKNOWLEDGMENTS It would not he possible t. the present ASC speed would be 41 in the table. The execution rate for the elementary operations of matrix multiply is one result per clock cycle for a one-pipe CP. 54 No. 6 6 It is the authors' OpInIOn that performance indices for array-oriented architectures are not meaningful when only the Millions of Instructions Per Second (MIPS) factor is used. Using the IBIV[ S/360 Model 65 as the basis of reference. it does have the characteristic source code sequences and reflects the ability of the Fortran compiler to produce efficient code from a large applications package. Another performance measure can be determined from the present performance of ASC System #4 executing a particular weather benchmark.o acknowledge all the contributors to the development of the ASC. This operation consists of the multiplication of appropriate elements of two arrays followed by the sum of the products. Since a single vector instruction is equivalent to several scalar instructions (typically Load. This ratio of 8." In this context "results per second" is the rate at which data fetched from central memory can be operated upon and the results stored back into central memory. the ASC compiler uses a single dot-product instruction and the complex indexing capability of the hardware to carry out the full matrix multiply.. and the hardware is designed to comprehend this level of indexing complexity. A double-nested DO LOOP with typical indexing conventions is shown in (1). or a rate of four results per clock cycle for a four-pipe CPo The compiler partitions the total matrix multiply across the appropriate number of pipes.oward matching the real world mix of instructions encountered in typical applications instead of sacrificing scalar capability to provide vector capability. and the necessary indexing information for addressing purposes is contained in each vector parameter file. Table III shows the maximum performance rates for one. Operation. .org) . Although the benchmark is not a full weather prediction code. page 546. In order to compare the observed ASC performance on the Weather Benchmark. Therefore.5 5 5 7 8 8 * Data taken from Table E.0 X 10 6 64 X 10 64 X 106 64 X 10 6 37 X 10 6 21 X 10 6 16 X 10 6 64-BIT RESULTS/SEC 6 ADD MULTIPLY DOT PRODUCT 16 X 10 16 X 10 6 16 X 10 6 It is a floating vector multiply instruction preceded by the loading of the vector parameter registers.2 X 19 5. Table II gives some typical code found in weather models. TABLE IV-Relative Computer Capacity* Third Generation Systems MFR IBM IBM CDC CDC IBM IBM HITACHI IBM CDC IBM MODEL S/360 MODEL S/360 MODEL 6500 6600 S/370 MODEL 8/360 MODEL HITAC 8800 S/360 MODEL 7600 S/360 MODEL ASC 4X (FOUR AU'S) 32-BIT 32-BIT RELATIVE SPEED 65 75 165 91 95 195 1. It reflects a mix of both scalar and vector instructions as well as I/O and other system services.. To implement a matrix multiply operation from Fortran. and the number of data values used determines the number of execution of these scalar instructions.and four-pipe ASC systems performing typical arithmetic operations.. June. Vector dot product is a special case in the sense that the results per second rate pertains to the elementary operations. Execution speed of the benchmark on the IBM Model 91 is approximately 246 minutes.6. Program for the study conference . to complete a matrix multiply of two N by N matrices. (~) gives the sequence of instructions produced by the ASC compiler. The design of the ASC has been directed t. Assumptions are that the clock cycle is 60 nanoseconds and that the pipelines are already filled with operands. Using the observed ASC/M91 ratio of 8. W. A. Little. REFERENCES 1. Dean. No. From the collection of the Computer History Museum (www. Galindo. D. Kastner. C. C. A. table E. M.org) . L. Garth. Nolte. Vol.. June 1973. G. Winkelman. A. \V. Many other members of the Texas Instruments staff have 397 also contributed i. R. L. Bulletin of the American Meteorological Society. G. 54. C. Riccomi.YJlIIleasurably in the development of the ASC. F. E. F. page 546. T. E. Stephenson.computerhistory. Boswell. E. Cragon. D.6. H. W. Hall. Software concepts are due in large part to the efforts of Messrs. M. Best. R. C. and N. D. and S.Operational Experiences with the TI Advanced Scientific Computer should be given to lVlessrs. Husband. Chandler who contributed significantly to the development of the hardware. H. Program for the study conference on the Modeling Aspects of Gate. Cohagan. computerhistory.org) .From the collection of the Computer History Museum (www.

Operational experiences with the TI Advanced Scientific Computer

Comments

Description