Lec08-IRAM

Vector IRAM: A Microprocessor Architecture for Media ProcessingChristoforos E. Kozyrakis [email protected] CS252 Graduate Computer Architecture February 10, 2000 Outline • Motivation for IRAM – technology trends – design trends – application trends • Vector IRAM – instruction set – prototype architecture – performance 2/10/2000 C.E. Kozyrakis, U.C. Berkeley Page 2 1 2 10 15 20 25 30 6 8 11 8 4. 1980 1981 1982 1983 1984 1985 1986 1987 1988 1989 1990 1991 1992 1993 1994 1995 1996 1997 1998 1999 2000 Time 2/10/2000 C.Processor-DRAM Gap (latency) 1000 Performance “Moore’s Law” CPU µProc 60%/yr. Kozyrakis. 100 10 1 Processor-Memory Performance Gap: (grows 50% / year) DRAM DRAM 7%/yr.5 7 11 9.C. U. Berkeley Page 3 Processor-DRAM Tax logic Intel PIII Xeon MIPS R12000 HP PA-8500 Sun Ultra-2 PowerPC G4 IBM Power3 AMD Athlon Alpha 21264 0 6 5 3 4 2 1.C.E. U. Berkeley Page 4 2 .2 126 memory 15 Million Transistors 2/10/2000 C.E.8 4. Kozyrakis. branch and load-use latency • Design complexity of high-end CPUs – 4 to 5 years from scratch to chips for new superscalar architectures – >100 engineers – >50% of resources to design verification 2/10/2000 C.C.C. Kozyrakis.E. Berkeley Page 6 3 . U. Berkeley Page 5 Other Design Challenges • Interconnect scaling problems – multiple cycles to go across the chip – difficult to achieve single cycle result forwarding – need to add extra pipeline stages at the cost of power.Power Consumption 60 50 Performance (Spec95FP) 40 30 20 10 0 0 20 40 Power (W) 60 80 Alpha 21264 AMD Athlon IBM Power3 PowerPC G4 Sun Ultra-2 HP PA-8500 MIPS R12000 Intel PIII Xeon 2/10/2000 C. U. complexity. Kozyrakis.E. Complexity Vs. Performance Gains • • • • • • • • R5000 Clock Rate 200 MHz On-Chip Caches 32K/32K Instructions/Cycle 1(+ FP) Pipe stages 5 Model In-order Die Size (mm2) 84 – wo cache. pagers. streaming data. cellular phones. Kozyrakis. digital cameras. voice/pattern recognition.0x 1. game consoles. cars etc.3x 5.0x 1. real-time requirements • Mobile and embedded environments – notebooks. Kozyrakis. limited chip-count. limited power/energy budget • Significantly different environment from the desktop/workstation model 2/10/2000 C. – narrow data types.7 R10000 195 MHz 32K/32K 4 5-7 Out-of-order 298 205 300 8.2x --3. PDAs.0x 1. U.6x 2/10/2000 C.C. digital music. Berkeley Page 7 Future microprocessor applications • Multimedia applications – image/video processing.E.. animation. Berkeley Page 8 4 . encryption etc.5x 6.0x 4.C.8 R10K/R5K 1. – small devices. TLB 32 Development 60 (man years) SPECint_base95 5. 3D graphics.E. U. E. Berkeley Page 9 Average vs.E. U.C. 45% 40% 35% 30% Average Which one is the best? Statistical ⇒ Average ⇒ C Real time ⇒ Worst ⇒ A Inputs 25% 20% 15% 10% 5% 0% A B C Best Case Page 10 Worst Case 2/10/2000 Performance C. Berkeley 5 .. Kozyrakis. Kozyrakis..C. real time performance .Requirements on microprocessors (1) • High performance for multimedia: – – – – – – – real-time performance guarantees support for continuous media data-types exploit fine-grain parallelism exploit coarse-grain parallelism exploit high instruction reference locality code density high memory bandwidth 2/10/2000 C. U. U. Berkeley Page 11 The IRAM vision statement Microprocessor & DRAM on a single chip: – on-chip memory latency 5-10X.Requirements on microprocessors (2) • Low power and energy consumption – energy efficiency for long battery life – power efficiency for system cost reduction (cooling system..) • Design scalability – performance scalability – physical design scalability • design complexity.C. verification complexity – immunity to interconnect scaling problems • locality of interconnect. Kozyrakis.C. Kozyrakis.E. bandwidth 50-100X – improve energy efficiency 2X-4X (no off-chip bus) – serial I/O 5-10X v.. packaging etc.E. Berkeley A M Page 12 6 . U. tolerance to latency • System-on-a-chip (SoC) – highly integrated system – low system chip-count 2/10/2000 C. buses – smaller board area/volume – adjustable memory size/width Proc $ $ L2$ Bus D R A M L o f g a i b c I/O I/O Bus I/O I/O Proc Bus D f R a A b M D R 2/10/2000 C. Kozyrakis. scalability small system size • Embedded DRAM 2/10/2000 C. U.C. 16b. low complexity scalability well understood software development high bandwidth for vector processing low power/energy for memory accesses modularity. 8b) support for strided and indexed memory accesses support for auto-increment addressing support for DSP operations (multiply-add. 32 vector flag registers support for multiple data types (64b.E. Berkeley Page 14 7 .Vector IRAM • Vector processing – – – – – – – – – high-performance for media processing low power/energy for processor control modularity.E. saturation etc) support for conditional execution support for software speculation support for fast reductions and butterfly permutations support for virtual memory restartable arithmetic (FP & integer) exceptions • Implemented as a coprocessor extension to MIPS64 ISA (coprocessor 2) 2/10/2000 C. 32b. U. Kozyrakis. Berkeley Page 13 IRAM ISA summary • Full vector instruction set with – – – – – – – – – – 32 vector registers.C. E. multiplier’s inputs have half the width • Uniform. Berkeley Page 16 8 . Berkeley Page 15 Fixed-point Multiply-add Mul & Shift Right & Round Add & Sat x n/2 y n/2 * n Shift zn + Round n sat n w a • Multiply halves & shift instruction provides support for any fixed-point format • Precision is equal to the datatype width.E. Kozyrakis.C.C. simple support for all datatypes 2/10/2000 C. Kozyrakis. U.Vector architectural state Virtual Processors ($vlr) VP0 VP1 VP$vlr-1 Control Regs vcr0 vcr1 vcr31 64b General vr0 vr1 Purpose Registers vr31 (32) $vpw Flag Registers (32) vf0 vf1 vf31 1b Scalar Regs vs0 vs1 vs31 64b 2/10/2000 C. U. Berkeley Page 17 Design Overview • 64b MIPS scalar core – coprocessor interface – 16KB I/D caches • Memory system – 8 2MByte eDRAM banks – single sub-bank per bank – 256-bit synchronous interface.6ns column access – crossbar interconnect for 12.8 GB/sec per direction – no caches • Vector unit – 8KByte vector register file – support for 64b.E. Kozyrakis. 2 flag processing. separate I/O signals – 20ns cycle time. Kozyrakis.VIRAM-1 prototype 2/10/2000 C.E.C. U. Berkeley 9 . and 16b data-types – 2 arithmetic (1 FP). 6. 1 load-store units – 4 64-bit datapaths per unit – DRAM latency included in vector pipeline – 4 addresses/cycle for strided/indexed accesses – 2-level TLB 2/10/2000 • Network interface – user-level message passing – dedicated DMA engines – 4 100MByte/s links Page 18 C. U. 32b.C. U. .E. in-order pipeline – each instruction can specify up to 128 operations and occupy a functional unit for 8 cycles • DRAM latency is included in the execution pipeline (delayed pipeline) – deep pipeline design. Berkeley Page 19 Non-Delayed Pipeline F D X M W .E. U. Kozyrakis. DRAM latency: >=20ns vld VW mem vadd vst vld mem vadd vst . but not caches needed to avoid stalls – worst case DRAM latency does not cause pipeline stalls • Address decoupling buffer – buffers memory addresses in the presence of conflicts (indexed/strided accesses) – memory conflicts do not stall pipeline 2/10/2000 C. Berkeley Page 20 10 .Vector Unit Pipeline Structure • Single-issue.. . . Kozyrakis.C. . XN VW A T VR Load->ALU exposes full DRAM latency (long) 2/10/2000 C.C. VLOAD A T Long Load-> ALU RAW hazard VALU VSTORE VR X1 X2 .. .E. .C..E. . U. & Datapath FP Datapaths Integer Datapath 1 Xbar I/F Xbar I/F Integer Datapath 0 Vector Registers Flag Regs. Kozyrakis. VLOAD A T Load-> ALU RAW hazard VALU VSTORE DELAY VR X1 .. Berkeley Page 21 Clustered VLSI Design 64b Xbar I/F Integer Datapath 0 Vector Registers Control Flag Regs. . Berkeley Page 22 Xbar I/F Integer Datapath 0 Vector Registers Flag Regs. XN VW A T VR Load → ALU sees functional unit latency (short) 2/10/2000 C.Tolerating Memory Latency Delayed Pipeline F D X M W DRAM latency: >20ns . Kozyrakis. U. & Datapath FP Datapaths Integer Datapath 1 Xbar I/F 11 . VW vld vadd vst vld vadd vst .C. & Datapath FP Datapaths Integer Datapath 1 Xbar I/F Xbar I/F Integer Datapath 0 Vector Registers Flag Regs. & Datapath FP Datapaths Integer Datapath 1 Xbar I/F 256b 2/10/2000 C. C. Berkeley Page 23 Prototype Summary • Technology: – 0.8V power supply • • • • • Memory: 16 MBytes Clock frequency: 200MHz Power: 2 W for vector unit and memory Transistor count: ~140 millions Peak performance: – GOPS w. U. 6.2 (64b).C.18um eDRAM CMOS process (IBM) – 6 layers of copper interconnect – 1.4 (16b) – GFLOPS: 1.8 (16b) – GOPS wo. U. 3.6 (32b) 2/10/2000 C. 12. multiply-add: 3.2V and 1.4 (32b).2 (32b).E.E. Berkeley Page 24 12 . multiply-add: 1.6 (64b). Kozyrakis.VIRAM-1 Floorplan DRAM Bank 0 DRAM Bank 2 DRAM Bank 4 DRAM Bank 6 N I M I P S Vector Lane 0 Vector Lane 1 C T L Vector Lane 2 Vector Lane 3 I O DRAM Bank 1 DRAM Bank 3 DRAM Bank 5 DRAM Bank 7 2/10/2000 C. Kozyrakis. 6. 19 (5.5% 93.0% 30.40 GOPS 1. Kozyrakis.1x) TMS320C82 5.16 GOPS 2.6 GFLOPS 1.6% •Note : simulations did not include memory optimizations (address decoupling.50 (5.0% 98.4 GOPS 3.49 numbers in cycles/pixel •MMX. address hashing).7% 87. Berkeley Page 26 13 .07 GOPS 3. Image Composition iDCT Color Conversion Image Convolution Integer MV Multiply Integer VM Multiply FP MV Multiply FP VM Multiply AVERAGE 6.Kernels Performance Peak Perf.E.22 (17.70 (7. small strides optimizations.00 GOPS 1.3x) 0. and TMS results assume all data in L1 cache 2/10/2000 C.0x) 6. 6. Kozyrakis. VIS.49 (4.5x) VIS 2.97 GOPS 3.18 0.C.2x) 5.C.75 (3.13 1. Berkeley Page 25 Comparisons VIRAM Image Composition iDCT Color Conversion Image Convolution • All MMX 3. U.2 GOPS 3.5% 99.2 GOPS 1.40 GFLOPS 1.2 GOPS 3.59 GFLOPS % of Peak 100.7% 96.77 GOPS 3. or fixed-point multiply-add integer datapaths 2/10/2000 C.4 GOPS 6.2x) 8.6% 86.7% 86.6x) 6. U.78 5.6 GFLOPS Sustained Perf.E.00 (10.2 GOPS 3. FFT Performance 200 Time (microseconds) 150 Fixed Point (16 bit) Floating Point (32 bit) Pentium/200: 151 us TMS320C67x: 124 us 100 PPC604e: 87 us 50 TigerSHARC: 41 us VIRAM: 37 us CRI Pathfinder-1: 22. Berkeley Page 28 14 .3x107 QCIF (176x144) CIF (352x288) 2.E.4x108 •Note : MMX results assume all data in L1 cache 2/10/2000 C. Kozyrakis. U.3 us CRI Pulsar: 27. U.0x) 1.8x107 (5. Kozyrakis.9 us Wildstar: 25 us 0 128 256 512 1024 Size (#points in FFT) •Note : Simulations performed with unscheduled fixed-point code 2/10/2000 C.C.E. Berkeley Page 27 Motion Estimation Performance Size VIRAM-1 (cycles) 7.6x) MMX (cycles) 3.C.1x106 (4. 263 Akiyo (12. using exhaustive search for motion estimation and LLM for DCT. Berkeley Page 30 15 .47 kbit/s) 22. small strides optimizations.Overall Performance of H.E.C. address hashing). •Note : simulations did not include memory optimizations (address decoupling.263 on VIRAM standard mpeg test sequences.7fps Foreman (65. Kozyrakis.9fps •Average encoding speed for H. U.5 fps Mom (16.C.52 kbit/s) 20. Berkeley Page 29 Summary Class Project Suggestions • Architecture comparisons & applications – information retrieval – signal processing apps – neural nets training • Multimedia application analysis – operand reuse patterns – branch behavior – data/value locality and memory access patterns • Low power/energy architectures – energy-exposed ISA design – compilation for low energy – speculation use for power reduction 2/10/2000 C. Kozyrakis. U. or fixed-point multiply-add integer datapaths 2/10/2000 C.95 kbit/s) 23.25 kbit/s) 22.E.7fps Hall (20.

Comments

Description