CS6303 COMPUTER ARCHITECTURE Notes for 2013 Regulation

April 2, 2018 | Author: algatesgiri | Category: Computer Engineering, Office Equipment, Computing, Technology, Electrical Engineering


Comments



Description

Kingston Engineering CollegeChittoor Main Road, Katpadi, Vellore 632 059. Approved by AICTE, New Delhi & affiliated to Anna University, Chennai DEPARTMENT OF COMPUTER SCIENCE AND ENGINEERING THIRD SEMESTER CS6303 COMPUTER ARCHITECTURE NOTES Prepared By Mr. M. AZHAGIRI AP/CSE CS6303 COMPUTER ARCHITECTURE LTPC 3003 OBJECTIVES:  To make students understand the basic structure and operation of digital computer.  To understand the hardware-software interface.  To familiarize the students with arithmetic and logic unit and implementation of fixed point and floating-point arithmetic operations.  To expose the students to the concept of pipelining.  To familiarize the students with hierarchical memory system including cache memories and virtual memory.  To expose the students with different ways of communicating with I/O devices and standard I/O interfaces. UNIT I OVERVIEW & INSTRUCTIONS 9 Eight ideas – Components of a computer system – Technology – Performance – Power wall – Uniprocessors to multiprocessors; Instructions – operations and operands – representing instructions – Logical operations – control operations – Addressing and addressing modes. UNIT II ARITHMETIC OPERATIONS 7 ALU – Addition and subtraction – Multiplication – Division – Floating Point operations – Subword parallelism. UNIT III PROCESSOR AND CONTROL UNIT 11 Basic MIPS implementation – Building datapath – Control Implementation scheme – Pipelining – Pipelined datapath and control – Handling Data hazards & Control hazards – Exceptions. UNIT IV PARALLELISM 9 Instruction-level-parallelism – Parallel processing challenges – Flynn‘s classification – Hardware multithreading – Multicore processors UNIT V MEMORY AND I/O SYSTEMS 9 Memory hierarchy – Memory technologies – Cache basics – Measuring and improving cache performance – Virtual memory, TLBs – Input/output system, programmed I/O, DMA and interrupts, I/O processors. TOTAL: 45 PERIODS OUTCOMES: At the end of the course, the student should be able to:  Design arithmetic and logic unit.  Design and anlayse pipelined control units  Evaluate performance of memory systems.  Understand parallel processing architectures. TEXT BOOK: 1. David A. Patterson and John L. Hennessey, ―Computer Organization and Design‟, Fifth edition, Morgan Kauffman / Elsevier, 2014. REFERENCES: 1. V.Carl Hamacher, Zvonko G. Varanesic and Safat G. Zaky, ―Computer Organisation―, VI edition, Mc Graw-Hill Inc, 2012. 2. William Stallings ―Computer Organization and Architecture‖, Seventh Edition , Pearson Education, 2006. 3. Vincent P. Heuring, Harry F. Jordan, ―Computer System Architecture‖, Second Edition, Pearson Education, 2005. 4. Govindarajalu, ―Computer Architecture and Organization, Design Principles and Applications‖, first edition, Tata Mc Graw Hill, New Delhi, 2005. 5. John P. Hayes, ―Computer Architecture and Organization‖, Third Edition, Tata Mc Graw Hill, 1998. 6. http://nptel.ac.in/. UNIT I OVERVIEW & INSTRUCTIONS Eight ideas – Components of a computer system – Technology – Performance – Power wall – Uniprocessors to multiprocessors; Instructions – operations and operands – representing instructions – Logical operations – control operations – Addressing and addressing modes. 1.1 Eight ideas 1.2 Components of a computer system 1.3 Technology and Performance 1.4 Power wall 1.5 Uniprocessors to multiprocessors 1.6 Instructions – operations and operands 1.7 representing instructions 1.8 Logical operations 1.9 control operations 1.10 Addressing and addressing modes. 1.1 EIGHT IDEAS These ideas are so powerful they have lasted long after the first computer that used them. 1. Design for Moore‘s Law 2. Use Abstraction to Simplify Design 3. Make the Common Case Fast 4. Performance via Parallelism 5. Performance via Pipelining 6. Performance via Prediction 7. Hierarchy of Memories 8. Dependability via Redundancy Design for Moore’s Law Moore’s Law. It states that integrated circuit resources double every 18–24 months. computer architects must anticipate where the technology will be when the design finishes rather than design for where it starts. The resources available per chip can easily double or quadruple between the start and finish of the project. Use Abstraction to Simplify Design A major productivity technique for hardware and software is to use abstractions to represent the design at different levels of representation. lower-level details are hidden to offer a simpler model at higher levels. Make the Common Case Fast Making the common case fast will tend to enhance performance better than optimizing the rare case. Ironically, the common case is oft en simpler than the rare case and hence is oft en easier to enhance. Performance via Parallelism Computer architects have offered designs that get more performance by performing operations in parallel. Performance via Pipelining A particular pattern of parallelism is so prevalent in computer architecture that it merits its own name: pipelining. Performance via Prediction prediction, In some cases it can be faster on average to guess and start working rather than wait until you know for sure, assuming that the mechanism to recover from a misprediction is not too expensive and your prediction is relatively accurate. Hierarchy of Memories Programmers want memory to be fast, large, and cheap, as memory speed often shapes performance, capacity limits the size of problems that can be solved, and the cost of memory today is oft en the majority of computer cost. Hierarchy of Memories, with the fastest, smallest, and most expensive memory per bit at the top of the hierarchy and the slowest, largest, and cheapest per bit at the bottom. Caches give the programmer the illusion that main memory is nearly as fast as the top of the hierarchy and nearly as big and cheap as the bottom of the hierarchy. Dependability via Redundancy Computers not only need to be fast; they need to be dependable. Since any physical device can fail, we make systems dependable by including redundant components that can take over when a failure occurs and to help detect failures. 1.2 COMPONENTS OF A COMPUTER SYSTEM Software is organized primarily in a hierarchical fashion, with applications being the outermost ring and a variety of systems soft ware sitting between the hardware and applications software. There are many types of systems software, but two types of systems software are central to every computer system today: an operating system and a compiler. An operating system interfaces between a user‘s program and the hardware and provides a variety of services and supervisory functions. Among the most important functions are: • Handling basic input and output operations • Allocating storage and memory • Providing for protected sharing of the computer among multiple applications using it simultaneously. Examples of operating systems in use today are Linux, iOS, and Windows. FIGURE A simplified view of hardware and software as hierarchical layers. Compilers perform another vital function: the translation of a program written in a high-level language, such as C, C++, Java, or Visual Basic into instructions that the hardware can execute. From a High-Level Language to the Language of Hardware Assembler. This program translates a symbolic version of an instruction into the binary version. For example, the programmer would write add A,B and the assembler would translate this notation into 1000110010100000. The binary language that the machine understands is the machine language. Assembly language requires the programmer to write one line for every instruction that the computer will follow, forcing the programmer to think like the computer. In later stage, high-level programming languages and compilers were introduced, that translate High level language into instructions. Example High level language a=a+b; Assembly level language  add A,B Binary / Machine Language 1000110010100000 program High-level programming languages offer several important benefits.       They allow the programmer to think in a more natural language, using English words and algebraic notation. Fortran was designed for scientific computation. Cobol for business data processing. Lisp for symbol manipulation. It improved programmer productivity. Programming languages allow programs to be independent of the computer on which they were developed, since compilers and assemblers can translate high-level language programs to the binary instructions of any computer. 5 CLASSIC COMPONENTS OF A COMPUTER The five classic components of a computer are input, output, memory, data path, and control, with the last two sometimes combined and called the processor. I/O EQUIPMENT: The most fascinating I/O device is probably the graphics display. liquid crystal displays (LCDs) To get a thin, low-power display. The LCD is not the source of light; instead, it controls the transmission of light. A typical LCD includes rod-shaped molecules in a liquid that form a twisting helix that bends light entering the display, from either a light source behind the display or less often from reflected light. The rods straighten out when a current is applied and no longer bend the light. Since the liquid crystal material is between two screens polarized at 90 degrees, the light cannot pass through unless it is bent. Today, most LCD displays use an active matrix that has a tiny transistor switch at each pixel to precisely control current and make sharper images. A red-green-blue mask associated with each dot on the display determines the intensity of the three colour components in the final image; in a colour active matrix LCD, there are three transistor switches at each point. The image is composed of a matrix of picture elements, or pixels, which can be represented as a matrix of bits, called a bit map. A colour display might use 8 bits for each of the three colours (red, blue, and green). The computer hardware support for graphics consists mainly of a raster refresh buffer, or frame buffer, to store the bit map. The image to be represented onscreen is stored in the frame buffer, and the bit pattern per pixel is read out to the graphics display at the refresh rate. The processor: is the active part of the computer, following the instructions of a program to the letter. It adds numbers, tests numbers, signals I/O devices to activate, and so on. The processor logically comprises two main components: data path and control, the respective brawn and brain of the processor. The data path performs the arithmetic operations, and control tells the data path, memory, and I/O devices what to do according to the wishes of the instructions of the program. The memory: is where the programs are kept when they are running; it also contains the data needed by the running programs. Th e memory is built from DRAM chips. DRAM stands for dynamic random access memory. Multiple DRAMs are used together to contain the instructions and data of a program. In contrast to sequential access memories, such as magnetic tapes, the RAM portion of the term DRAM means that memory accesses take basically the same amount of time no matter what portion of the memory is read. 1.3.1- Technology chip manufacturing process The manufacture of a chip begins with silicon, a substance found in sand. Because silicon does not conduct electricity well, it is called a semiconductor. With a special chemical process, it is possible to add materials to silicon that allow tiny areas to transform into one of three devices: Excellent conductors of electricity (using either microscopic copper or aluminium wire) Excellent insulators from electricity (like plastic sheathing or glass) Areas that can conduct or insulate under special conditions (as a switch) Transistors fall in the last category. A VLSI circuit, then, is just billions of combinations of conductors, insulators, and switches manufactured in a single small package. Figure shows process for integrated chip manufacturing. The process starts with a silicon crystal ingot, which looks like a giant sausage. Today, ingots are 8–12 inches in diameter and about 12–24 inches long. An ingot is finely sliced into wafers no more than 0.1 inches thick. These wafers then go through a series of processing steps, during which patterns of chemicals are placed on each wafer, creating the transistors, conductors, and insulators. The simplest way to cope with imperfection is to place many independent components on a single wafer. The patterned wafer is then chopped up, or diced, into these components, called dies and more informally known as chips. To reduce the cost, using the next generation process shrinks a large die as it uses smaller sizes for both transistors and wires. This improves the yield and the die count per wafer. Once you‘ve found good dies, they are connected to the input/output pins of a package, using a process called bonding. These packaged parts are tested a final time, since mistakes can occur in packaging, and then they are shipped to customers. 1.3.2 PERFORMANCE Running a program on two different desktop computers, you‘d say that the faster one is the desktop computer that gets the job done first. If you were running a data centre that had several servers running jobs submitted by many users, you‘d say that the faster computer was the one that completed the most jobs during a day. As an individual computer user, you are interested in reducing response time—the time between the start and completion of a task— also referred as execution time. Data centre managers are often interested in increasing throughput or bandwidth—the total amount of work done in a given time. Measuring Performance: The computer that performs the same amount of work in the least time is the fastest. Program execution time is measured in seconds per program. CPU execution time or simply CPU time, which recognizes this distinction, is the time the CPU spends computing for this task and does not include time spent waiting for I/O or running other programs. CPU time can be further divided into the CPU time spent in the program, called user CPU time, and the CPU time spent in the operating system performing tasks on behalf of the program, called system CPU time. The term system performance to refer to elapsed time on an unloaded system and CPU performance to refer to user CPU time. CPU Performance and Its Factors: CPU execution time for a program = CPU clock cycles for a program X Clock cycle time Alternatively, because clock rate and clock cycle time are inverses, CPU execution time for a program = CPU clock cycles for a program/Clock rate This formula makes it clear that the hardware designer can improve performance by reducing the number of clock cycles required for a program or the length of the clock cycle. Instruction Performance: The performance equations above did not include any reference to the number of instructions needed for the program. The execution time must depend on the number of instructions in a program. Here execution time is that it equals the number of instructions executed multiplied by the average time per instruction. clock cycles required for a program can be written as CPU clock cycles = Instructions for a program X Average clock cycles per instruction The term clock cycles per instruction, which is the average number of clock cycles each instruction takes to execute, is often abbreviated as CPI. CPI provides one way of comparing two different implementations of the same instruction set architecture, since the number of instructions executed for a program will be the same. The Classic CPU Performance Equation: The basic performance equation in terms of instruction count (the number of instructions executed by the program), CPI, and clock cycle time: CPU time = Instruction count X CPI X Clock cycle time or, since the clock rate is the inverse of clock cycle time: CPU time = (Instruction count X CPI) / Clock rate These formulas are particularly useful because they separate the three key factors that affect performance. Components of performance CPU execution time for a program Instruction count Clock cycles per instruction (CPI) Clock cycle time Units of Measure Seconds for the program Instructions executed for the program Average number of clock cycles per Seconds per clock cycle We can measure the CPU execution time by running the program, and the clock cycle time is usually published as part of the documentation for a computer. The instruction count and CPI can be more difficult to obtain. Of course, if we know the clock rate and CPU execution time, we need only one of the instruction count or the CPI to determine the other. 1.4 POWER WALL Both clock rate and power increased rapidly for decades, and then flattened off recently. The energy metric joules is a better measure than a power rate like watts, which is just joules/second. Dominant technology for integrated circuits is called CMOS (complementary metal oxide semiconductor). For CMOS, the primary source of energy consumption is so-called dynamic energy—that is, energy that is consumed when transistors switch states from 0 to 1 and vice versa. Energy α Capacitive load X (Voltage)2 This equation is the energy of a pulse during the logic transition of 0 →1 →0 or 1 →0 →1. The energy of a single transition is then Energy α 1/2 X Capacitive load X (Voltage)2 The power required per transistor is just the product of energy of a transition and the frequency of transitions: Power α Energy X Frequency switched [or] Power α 1/2 X Capacitive load X (Voltage)2 X Frequency switched Frequency switched is a function of the clock rate. The capacitive load per transistor is a function of both the number of transistors connected to an output (called the fan out) and the technology, which determines the capacitance of both wires and transistors. Energy and thus power can be reduced by lowering the voltage, which occurred with each new generation of technology, and power is a function of the voltage squared. In 20 years, voltages have gone from 5 V to 1 V, which is why the increase in power is only 30 times. To try to address the power problem, designers have already attached large devices to increase cooling, and they turn off parts of the chip that are not used in a given clock cycle. 1.5 UNIPROCESSORS TO MULTIPROCESSORS The power limit has forced a dramatic change in the design of microprocessors. shows the improvement in response time of programs for desktop microprocessors over time. Since 2002, the rate has slowed from a factor of 1.5 per year to a factor of 1.2 per year. Rather than continuing to decrease the response time of a single program running on the single processor, as of 2006 all desktop and server companies are shipping microprocessors with multiple processors per chip, where the benefit is oft en more on throughput than on response time. To reduce confusion between the words processor and microprocessor, companies refer to processors as ―cores,‖ and such microprocessors are generically called multicore microprocessors. Hence, a ―quad core‖ microprocessor is a chip that contains four processors or four cores. In the past, programmers could rely on innovations in hardware, architecture, and compilers to double performance of their programs every 18 months without having to change a line of code. Today, for programmers to get significant improvement in response time, they need to rewrite their programs to take advantage of multiple processors. Moreover, to get the historic benefit of running faster on new microprocessors, programmers will have to continue to improve performance of their code as the number of cores increases. To reinforce how the soft ware and hardware systems work hand in hand, we use a special section; Hardware/Soft ware Interface, throughout the book, with the first one appearing below. These elements summarize important insights at this critical interface. 1. Increasing the clock speed of Uniprocessor has reached saturation and cannot be increased beyond a certain limit because of power consumption and heat dissipation issues. 2. As the physical size of chip decreased, while the number of transistors/chip increased, clock speed increased, which boosted the heat dissipation across the chip to a dangerous level. Cooling & heat sink requirement issues were there. 3. There were limitations in the use of silicon surface area. 4. There were limitations in reducing the size of individual gates further. 5. To gain Performance within a single core, many techniques like pipelining, super pipelined, superscalar architectures are used . 6. Most of the early dual core processors were running at lower clock speeds, the rational behind is that a dual core processor with each running at 1Ghz should be equivalent to a single core processor running at 2 Ghz. 7. The Problem is that this does not work in practice when the applications are not written to take advantage of the multiple processors. Until the software is written this way, unthreaded applications will run faster on a single processor than on a dual core cpu. 8. In Multi-core processors, the benefit is more on throughput than on response time. 9. In the past, programmers could rely on innovations in the hardware, Architecture and compilers to double performance of their programs every 18 months without having to change a line of code. 10. Today, for programmers to get significant improvement in response time, they need to rewrite their programs to take advantage of multiple processors and also they have to improve performance of their code as the number of core increases. The need of the hour is…….. 11. Ability to write Parallel programs 12. Care must be taken to reduce Communication and Synchronization overhead. Challenges in Scheduling, load balancing have to be addressed. 1.6 INSTRUCTIONS – OPERATIONS AND OPERANDS Operations in MIPS: Every computer must be able to perform arithmetic. The MIPS assembly language notation add a, b, c Instructs a computer to add the two variables b and c and to put their sum in a. This notation is rigid in that each MIPS arithmetic instruction performs only one operation and must always have exactly three variables. EXAMPLE, To add 4 variables, b,c,d,e and store it in a. add a, b, c # The sum of b and c is placed in a add a, a, d # The sum of b, c, and d is now in a add a, a, e # The sum of b, c, d, and e is now in a Thus, it takes three instructions to sum the four variables. Design Principle 1: Simplicity favours regularity. EXAMPLE: Compiling Two C Assignment Statements into MIPS This segment of a C program contains the five variables a, b, c, d, and e. Since Java evolved from C, this example and the next few work for either high-level programming language: a = b + c; d = a – e; The translation from C to MIPS assembly language instructions is performed by the compiler. Show the MIPS code produced by a compiler. A MIPS instruction operates on two source operands and places the result in one destination operand. Hence, the two simple statements above compile directly into these two MIPS assembly language instructions: add a, b, c sub d, a, e Operands in MIPS: The operands of arithmetic instructions are restricted; they must be from a limited number of special locations built directly in hardware called registers. The size of a register in the MIPS architecture is 32 bits; groups of 32 bits occur so frequently that they are given the name word in the MIPS architecture. Design Principle 2: Smaller is faster. A very large number of registers may increase the clock cycle time simply because it takes electronic signals longer when they must travel farther. So, 32 registers were used in MIPS architecture. The MIPS convention is to use two-character names following a dollar sign to represent a register. eg: $s0, $s1 Example: f = (g + h) – (i + j); instructions using registers. add $t0,$s1,$s2 # register $t0 contains g + h add $t1,$s3,$s4 # register $t1 contains i + j sub $s0,$t0,$t1 # f gets $t0 – $t1, which is (g + h)–(i + j) Memory Operands: Programming languages have simple variables that contain single data elements, as in these examples, but they also have more complex data structures—arrays and structures. These complex data structures can contain many more data elements than there are registers in a computer. The processor can keep only a small amount of data in registers, but computer memory contains billions of data elements. So, MIPS must include instructions that transfer data between memory and registers. Such instructions are called data transfer instructions. To access a word in memory, the instruction must supply the memory address. The data transfer instruction that copies data from memory to a register is traditionally called load. The format of the load instruction is the name of the operation followed by the register to be loaded, then a constant and register used to access memory. The sum of the constant portion of the instruction and the contents of the second register forms the memory address. The actual MIPS name for this instruction is lw, standing for load word. EXAMPLE g = h + A[8]; To get A[8] from memory use lw, lw $t0,8($s3) # Temporary reg $t0 gets A[8] Use Result of A[8] stored in $t0, add $s1,$s2,$t0 # g = h + A[8] The constant in a data transfer instruction (8) is called the off set, and the register added to form the address ($s3) is called the base register. In MIPS, words must start at addresses that are multiples of 4. This requirement is called an alignment restriction, and many architectures have it.(since in MIPS each 32 bits form a word in memory, so the address of one word to another jumps in multiples of 4) Byte addressing also affects the array index. To get the proper byte address in the code above, the off set to be added to the base register $s3 must be 4 x 8, or 32,(as per previous example). EXAMPLE: g = h + A[8]; (implemented based on byte address) To get A[8] from memory use lw and calculate (8 x4) = 32 which is the actual offset value, Use Result of A[8] i.e., stored in$t0, lw $t0,32($s3) # Temporary reg $t0 gets A[8] add $s1,$s2,$t0 # g = h + A[8] The instruction complementary to load is traditionally called store; it copies data from a register to memory. The format of a store is similar to that of a load: the name of the operation, followed by the register to be stored, then off set to select the array element, and finally the base register. Once again, the MIPS address is specified in part by a constant and in part by the contents of a register. The actual MIPS name is sw, standing for store word. EXAMPLE: A[12] = h + A[8]; lw $t0,32($s3) # Temporary reg $t0 gets A[8], note (8 x4) used. add $t0,$s2,$t0 # Temporary reg $t0 gets h + A[8] sw $t0,48($s3) # Stores h + A[8] back into A[12], note (12 x 4) used. Constant or Immediate Operands: For example, to add the constant 4 to register $s3, we could use the code lw $t0, AddrConstant4 ($s1) # $t0 = constant 4 add $s3,$s3,$t0 # $s3 = $s3 + $t0 ($t0 == 4) Alternative that avoids the load instruction is to offer versions of the arithmetic instructions in which one operand is a constant. Example: addi $s3,$s3,4 add immediate or add instructions. # $s3 = $s3 + 4 Constant operands occur frequently, and by including constants inside arithmetic instructions, operations are much faster and use less energy than if constants were loaded from memory. INSTRUCTIONS AND ITS TYPES THAT ARE USED IN MIPS Registers are referred to in instructions, there must be a convention to map register names into numbers. In MIPS assembly language, registers $s0 to $s7 map onto registers 16 to 23, and registers $t0 to $t7 map onto registers 8 to 15. Hence, $s0 means register 16, $s1 means register 17, $s2 means register 18, . . . , $t0 means register 8, $t1 means register 9, and so on. MIPS Fields for instruction       op: Basic operation of the instruction, traditionally called the opcode. rs: The first register source operand. rt: The second register source operand. rd: The register destination operand. It gets the result of the operation. shamt: Shift amount. (Section 2.6 explains shift instructions and this term; it will not be used until then, and hence the field contains zero in this section.) funct: Function. This field, oft en called the function code, selects the specific variant of the operation in the op field. A problem occurs when an instruction needs longer fields than those shown above. MIPS designers is kept all instructions the same length, thereby requiring different kinds of instruction formats for different kinds of instructions. The format above is called R-type (for register) or R-format. A second type of instruction format is called I-type (for immediate) or I-format and is used by the immediate and data transfer instructions. The fields of I-format are Multiple formats complicate the hardware; we can reduce the complexity by keeping the formats similar. For example, the first three fields of the R-type and I-type formats are the same size and have the same names; the length of the fourth field in I-type is equal to the sum of the lengths of the last three fields of R-type. Note that the meaning of the rt field has changed for this instruction: the rt field specifies the destination register FIG: MIPS instruction encoding. Example: MIPS instruction encoding in computer hardware. Consider A[300] = h + A[300]; the MIPS instruction for the operations are: lw $t0,1200($t1) # Temporary reg $t0 gets A[300] add $t0,$s2,$t0 # Temporary reg $t0 gets h + A[300] sw $t0,1200($t1) # Stores h + A[300] back into A[300] Tabulation shows how hardware decodes and determines the three machine language instructions: op rs rt 35 0 43 9 18 9 8 8 8 rd address/shamt 8 1200 0 1200 funct 32 lw add sw The lw instruction is identified by 35 (OP field), The add instruction that follows is specified with 0 (OP field), The sw instruction is identified with 43 (OP field). 100011 1001 1000 00000 10010 1000 101011 1001 1000 0000 0100 1011 0000 1000 00000 100000 0000 0100 1011 0000 lw add sw Binary version of the above Tabulation 1.8 LOGICAL OPERATIONS List of logical operators used in MIPS and other languages along with symbolic notation. SHIFT LEFT (sll ) The first class of such operations is called shifts. They move all the bits in a word to the left or right, filling the emptied bits with 0s. For example, if register $s0 contained 0000 0000 0000 0000 0000 0000 0000 1001two = 9ten and the instruction to shift left by 4 was executed, the new value would be: 0000 0000 0000 0000 0000 0000 1001 0000two = 144ten The dual of a shift left is a shift right. The actual name of the two MIPS shift instructions are called shift left logical (sll) and shift right logical (srl). sll $t2,$s0,4 # reg $t2 = reg $s0 << 4 bits(shifted 4 places) shamt field in the R-format. Used in shift instructions, it stands for shift amount. The encoded version of above Shift instruction is shown below. Also Shifting left by i bits gives the same result as multiplying by 2i (refer above Also Shifting left by representation fori bits gives the same result as multiplying by 2i (refer above representation for9 and 144) (9 x 2 4 = 9 x 16 = 144, where i =4, since left shift done 4 times) LOGICAL AND (and) AND is a bit-by-bit operation that leaves a 1 in the result only if both bits of the operands are 1. For example, if register $t2 contains 0000 0000 0000 0000 0000 1101 1100 0000two and register $t1 contains 0000 0000 0000 0000 0011 1100 0000 0000two then, after executing the MIPS instruction and $t0,$t1,$t2 # reg $t0 = reg $t1 & reg $t2 the value of register $t0 would be 0000 0000 0000 0000 0000 1100 0000 0000two (example for bit wise, note: do not add , ……..00101 ….10111 ………………………………………… Bit wise AND→ 00101 (use AND truth table for each bit)) AND is traditionally called a mask, since the mask ―conceals‖ some bits. LOGICAL OR (or) It is a bit-by-bit operation that places a 1 in the result if either operand bit is a 1. To elaborate, if the registers $t1 and $t2 are unchanged from the preceding i.e., register $t2 contains and 0000 0000 0000 0000 0000 1101 1100 0000two register $t1 contains 0000 0000 0000 0000 0011 1100 0000 0000two then, after executing the MIPS instruction or $t0,$t1,$t2 # reg $t0 = reg $t1 | reg $t2 the value in register $t0 would be (example for bit wise, ……..00101 0000 0000 0000 0000 0011 1101 1100 0000two ….10111 Bit wise OR→ 10111 (use OR truth table for each bit)) LOGICAL NOT (nor) The final logical operation is a contrarian. NOT takes one operand and places a 1 in the result if one operand bit is a 0, and vice versa. Since MIPS needs three-operand format, the designers of MIPS decided to include the instruction NOR (NOT OR) instead of NOT. Step 1: Perform bit wise OR , ……..00101 ….00000 (dummy operation register filled with zero) ………………. 00101 Step 2: Take Inverse for the above result now we get 11010 Instruction: nor $t0,$t1,$t3 # reg $t0 = ~ (reg $t1 | reg $t3) Constants are useful in AND and OR logical operations as well as in arithmetic operations, so MIPS also provides the instructions and immediate (andi) and or immediate (ori). 1.9 CONTROL OPERATIONS Branch and Conditional branches: Decision making is commonly represented in programming languages using the if statement, sometimes combined with go to statements and labels. MIPS assembly language includes two decision-making instructions, similar to an if statement with a go to. The first instruction is beq register1, register2, L1 This instruction means go to the statement labelled L1 if the value in register1 equals the value in register2. The mnemonic beq stands for branch if equal. The second instruction is bne register1, register2, L1 It means go to the statement labeled L1 if the value in register1 does not equal the value in register2. The mnemonic bne stands for branch if not equal. These two instructions are traditionally called conditional branches. EXAMPLE: if (i == j) f = g + h; else f = g – h; the MIPS version of the given statements is bne $s3,$s4, Else # go to Else if i ≠ j add $s0,$s1,$s2 #f=g+h j Exit # go to Exit sub $s0,$s1,$s2 #f=g–h (skipped if i ≠ j) (skipped if i = j) Else: Exit: Here bne is used instead of beq, because bne(not equal to) instruction provides a better efficiency. This example introduces another kind of branch, often called an unconditional branch. This instruction says that the processor always follows the branch. To distinguish between conditional and unconditional branches, the MIPS name for this type of instruction is jump, abbreviated as j. (in example:- f, g, h, i, and j are variables mapped to fi ve registers $s0 through $s4) Loops: Decisions are important both for choosing between two alternatives—found in if Statements—and for iterating a computation—found in loops. The same assembly instructions are the basic building blocks for both cases(if and loop). EXAMPLE: while (save[i] == k) i += 1; the MIPS version of the given statements Assume that i and k correspond to registers $s3 and $s5 and the base of the array save is in $s6. Loop: sll $t1,$s3,2 # Temp reg $t1 = i * 4 To get the address of save[i], we need to add $t1 and the base of save in $s6: add $t1,$t1,$s6 # $t1 = address of save[i] Now we can use that address to load save[i] into a temporary register: lw $t0,0($t1) # Temp reg $t0 = save[i] bne $t0,$s5, Exit addi $s3,$s3,1 j Loop Exit: # go to Exit if save[i] ≠ k #i=i+1 # go to Loop The next instruction performs the loop test, exiting if save[i] ≠ k: The next instruction adds 1 to i: 1.10 The end of the loop branches back to the while test at the top of the loop. We just add the Exit label after it, and we‘re done: ADDRESSING AND ADDRESSING MODES. Addressing types: Three address instructions Syntax: opcode source1, source2,destination Eg: ADD A,B, C (operation is A= B+C) Two address instructions Syntax: opcode source, destination Eg: ADD A, B (operation is A=A+B) One-address instruction (to fit in one word length) Syntax: opcode source Eg: STORE C (copies content of accumulator to memory location C) where accumulator means cache memory register. Zero-address instructions stack operation Syntax: opcode Eg: PUSH A (All addresses are implicit, pushes value in A to stack) Addressing Modes: The different ways in which the location of an operand is specified in an instruction are referred to as addressing modes. It is a method used to determine which part of memory is being referred by a machine instruction. Register mode: Operand is the content of a processor register. The register name/address is given in the instruction. Value of R2 is moved to R1. Example: MOV R1, R2 Absolute mode (direct): Operand is in the memory location. Address of the location is given explicitly. Here value in A is moved to 1000H. Example: MOV 1000, A Immediate mode: Address and data constants can be given explicitly in the instruction. Here value constant 200 is moved to R0 register. Example: MOV #200, R0 Indirect Mode: The processor will read the register content (R1) in this case, which will not have direct value. Instead, it will be the address or location in which, the value will be stored. Then the fetched value is added with the value in R0 register. Example: ADD (R1), R0 Indexed / Relative Addressing Mode: The processor will take R1 register address as base address and adds the value constant 20 (offset / displacement) with the base address to get the derived or actual memory location of the value i.e., stored in the memory. It fetches the value then adds the value to R2 register. Example: ADD 20(R1), R2 Auto increment mode and Auto decrement Mode: The value in the register / address that is supplied in the instruction is incremented or decremented. Example: Increment R1 (Increments the given register / address content by one) Example: Decrement R2 (Decrements the given register / address content by one) UNIT II ARITHMETIC OPERATIONS ALU – Addition and subtraction – Multiplication – Division – Floating Point operations – Sub word parallelism. 2.1 ALU 2.2 Addition and subtraction 2.3 Multiplication 2.4 Division 2.5 Floating Point operations 2.6 Sub word parallelism. 2.1 ALU Arithmetic logic unit An arithmetic logic unit (ALU) is a digital electronic circuit that performs arithmetic and bitwise logical operations on integer binary numbers. This is in contrast to a floating-point unit (FPU), which operates on floating point numbers. An ALU is a fundamental building block of many types of computing circuits, including the central processing unit (CPU) of computers, FPUs, and graphics processing units (GPUs). A single CPU, FPU or GPU may contain multiple ALUs. The inputs to an ALU are the data to be operated on, called operands, and a code indicating the operation to be performed; the ALU's output is the result of the performed operation. In many designs, the ALU also exchanges additional information with a status register, which relates to the result of the current or previous A symbolic representation of an ALU and its input and output signals, indicated by arrows pointing into or out of the ALU, respectively. Each arrow represents one or more signals. Signals An ALU has a variety of input and output nets, which are the shared electrical connections used to convey digital signals between the ALU and external circuitry. When an ALU is operating, external circuits apply signals to the ALU inputs and, in response, the ALU produces and conveys signals to external circuitry via its outputs. Data A basic ALU has three parallel data buses consisting of two input operands (A and B) and a result output (Y). Each data bus is a group of signals that conveys one binary integer number. Typically, the A, B and Y bus widths (the number of signals comprising each bus) are identical and match the native word size of the encapsulating CPU (or other processor). Opcode The opcode input is a parallel bus that conveys to the ALU an operation selection code, which is an enumerated value that specifies the desired arithmetic or logic operation to be performed by the ALU. The opcode size (its bus width) is related to the number of different operations the ALU can perform; for example, a four-bit opcode can specify up to sixteen different ALU operations. Generally, an ALU opcode is not the same as a machine language opcode, though in some cases it may be directly encoded as a bit field within a machine language opcode. Status The status outputs are various individual signals that convey supplemental information about the result of an ALU operation. These outputs are usually stored in registers so they can be used in future ALU operations or for controlling conditional branching. The collection of bit registers that store the status outputs are often treated as a single, multi-bit register, which is referred to as the "status register" or "condition code register". General-purpose ALUs commonly have status signals such as:  Carry-out, which conveys the carry resulting from an addition operation, the borrow resulting from a subtraction operation, or the overflow bit resulting from a binary shift operation.  Zero, which indicates all bits of the Y bus are logic zero.  Negative, which indicates the result of an arithmetic operation is negative.  Overflow, which indicates the result of an arithmetic operation has exceeded the numeric range of the Y bus.  Parity, which indicates whether an even or odd number of bits on the Y bus are logic one. The status input allows additional information to be made available to the ALU when performing an operation. Typically, this is a "carry-in" bit that is the stored carry-out from a previous ALU operation. Circuit operation An ALU is a combinational logic circuit, meaning that its outputs will change asynchronously in response to input changes. In normal operation, stable signals are applied to all of the ALU inputs and, when enough time (known as the "propagation delay") has passed for the signals to propagate through the ALU circuitry, the result of the ALU operation appears at the ALU outputs. The external circuitry connected to the ALU is responsible for ensuring the stability of ALU input signals throughout the operation, and for allowing sufficient time for the signals to propagate through the ALU before sampling the ALU result. For example, a CPU begins an ALU addition operation by routing operands from their sources (which are usually registers) to the ALU's operand inputs, while the control unit simultaneously applies a value to the ALU's opcode input, configuring it to perform addition. At the same time, the CPU also routes the ALU result output to a destination register that will receive the sum. The ALU's input signals, which are held stable until the next clock, are allowed to propagate through the ALU and to the destination register while the CPU waits for the next clock. When the next clock arrives, the destination register stores the ALU result and, since the ALU operation has completed, the ALU inputs may be set up for the next ALU operation. The combinational logic circuitry of the 74181 integrated circuit, which is a simple four-bit ALU Functions A number of basic arithmetic and bitwise logic functions are commonly supported by ALUs. Basic, general purpose ALUs typically include these operations in their repertoires: Arithmetic operations  Add: A and B are summed and the sum appears at Y and carry-out.  Add with carry: A, B and carry-in are summed and the sum appears at Y and carryout.  Subtract: B is subtracted from A (or vice-versa) and the difference appears at Y and carry-out. For this function, carry-out is effectively a "borrow" indicator. This operation may also be used to compare the magnitudes of A and B; in such cases the Y output may be ignored by the processor, which is only interested in the status bits (particularly zero and negative) that result from the operation.  Subtract with borrow: B is subtracted from A (or vice-versa) with borrow (carry-in) and the difference appears at Y and carry-out (borrow out).  Two's complement (negate): A (or B) is subtracted from zero and the difference appears at Y.  Increment: A (or B) is increased by one and the resulting value appears at Y.  Decrement: A (or B) is decreased by one and the resulting value appears at Y.  Pass through: all bits of A (or B) appear unmodified at Y. This operation is typically used to determine the parity of the operand or whether it is zero or negative. Bitwise logical operations  AND: the bitwise AND of A and B appears at Y.  OR: the bitwise OR of A and B appear at Y.  Exclusive-OR: the bitwise XOR of A and B appear at Y.  One's complement: all bits of A (or B) are inverted and appear at Y. Bit shift operations ALU shift operations cause operand A (or B) to shift left or right (depending on the opcode) and the shifted operand appears at Y. Simple ALUs typically can shift the operand by only one bit position, whereas more complex ALUs employ barrel shifters that allow them to shift the operand by an arbitrary number of bits in one operation. In all single-bit shift operations, the bit shifted out of the operand appears on carry-out; the value of the bit shifted into the operand depends on the type of shift.  Arithmetic shift: the operand is treated as a two's complement integer, meaning that the most significant bit is a "sign" bit and is preserved.  Logical shift: a logic zero is shifted into the operand. This is used to shift unsigned integers.  Rotate: the operand is treated as a circular buffer of bits so its least and most significant bits are effectively adjacent.  Rotate through carry: the carry bit and operand are collectively treated as a circular buffer of bits. Big Endian:In big endian, you store the most significant byte in the smallest address. Little Endian:In little endian, you store the least significant byte in the smallest address. Fixed-point arithmetic A fixed-point number representation is a real data type for a number that has a fixed number of digits after (and sometimes also before) the radix point (after the decimal point '.' in English decimal notation). Fixed-point number representation can be compared to the more complicated (and more computationally demanding) floating-point number representation. Fixed-point numbers are useful for representing fractional values, usually in base 2 or base 10, when the executing processor has no floating point unit (FPU) or if fixed-point provides improved performance or accuracy for the application at hand. Sign-Magnitude The sign-magnitude binary format is the simplest conceptual format. To represent a number in sign-magnitude, we simply use the leftmost bit to represent the sign, where 0 means positive, and the remaining bits to represent the magnitude B7 B6 Sign B5 B4 B3 B2 B1 B0 magnitude What are the decimal values of the following 8-bit sign-magnitude numbers? 10000011 = -3 00000101 = +5 11111111 = ? 01111111 = ? 1's complement The 1's complement of a number is found by changing all 1's to 0's and all 0's to 1's. This is called as taking complement or 1's complement. Example of 1's Complement is as follows. 2's complement The 2's complement of binary number is obtained by adding 1 to the Least Significant Bit (LSB) of 1's complement of the number. 2's complement = 1's complement + 1 Example of 2's Complement is as follows. 2.2 ADDITION AND SUBTRACTION Binary Addition In fourth case, a binary addition is creating a sum of (1 + 1 = 10) i.e. 0 is written in the given column and a carry of 1 over to the next column. Example – Addition Half Adder and Full Adder Circuit Half Adder The half adder adds two binary digits called as augend and addend and produces two outputs as sum and carry; XOR is applied to both inputs to produce sum and OR gate is applied to both inputs to produce carry. By using half adder, you can design simple addition with the help of logic gates. Half Adder Logic Circuit Half Adder block diagram Half Adder Truth Table Full Adder An adder is a digital circuit that performs addition of numbers. The full adder adds 3 one bit numbers, where two can be referred to as operands and one can be referred to as bit carried in. And produces 2-bit output, and these can be referred to as output carry and sum. This adder is difficult to implement than a half-adder. The difference between a halfadder and a full-adder is that the full-adder has three inputs and two outputs, whereas half adder has only two inputs and two outputs. The first two inputs are A and B and the third input is an input carry as C-IN. When full-adder logic is designed, you string eight of them together to create a byte-wide adder and cascade the carry bit from one adder to the next. Full Adder Truth Table: Implementation of full order with two half adders N-Bit Parallel Adder The Full Adder is capable of adding only two single digit binary number along with a carry input. But in practical we need to add binary numbers which are much longer than just one bit. To add two n-bit binary numbers we need to use the n-bit parallel adder. It uses a number of full adders in cascade. The carry output of the previous full adder is connected to carry input of the next full adder. 4 Bit Parallel Adder In the block diagram, A0 and B0 represent the LSB of the four bit words A and B. Hence Full Adder-0 is the lowest stage. Hence its Cin has been permanently made 0. The rest of the connections are exactly same as those of n-bit parallel adder is shown in fig. The four bit parallel adder is a very common logic circuit. Block diagram of N-Bit Parallel Adder Binary Subtraction Subtraction and Borrow, these two words will be used very frequently for the binary subtraction. Operation A-B is performed using four rules of binary subtraction. 1.Take 2‘s compliment of B 2.ResultA+2‘s compliment of B 3.If a carry is generated then the result is positive and in the true form, in this case carry is ignored 4.If carry is not generated then the result is negative and in the 2‘s compliment form N-Bit Parallel Subtractor The subtraction can be carried out by taking the 1's or 2's complement of the number to be subtracted. For example we can perform the subtraction (A-B) by adding either 1's or 2's complement of B to A. That means we can use a binary adder to perform the binary subtraction. 4 Bit Parallel Subtractor The number to be subtracted (B) is first passed through inverters to obtain its 1's complement. The 4-bit adder then adds A and 2's complement of B to produce the subtraction. S3 S2 S1 S0 represents the result of binary subtraction (A-B) and carry output Cout represents the polarity of the result. If A > B then Cout = 0 and the result of binary form (A-B) then Cout = 1 and the result is in the 2's complement form. Block diagram of N Bit Subtractor Half Subtractors Half subtractor is a combination circuit with two inputs and two outputs (difference and borrow). It produces the difference between the two binary bits at the input and also produces an output (Borrow) to indicate if a 1 has been borrowed. In the subtraction (A-B), A is called as Minuend bit and B is called as Subtrahend bit. Truth Table Circuit Diagram Full Subtractors The disadvantage of a half subtractor is overcome by full subtractor. The full subtractor is a combinational circuit with three inputs A,B,C and two output D and C'. A is the 'minuend', B is 'subtrahend', C is the 'borrow' produced by the previous stage, D is the difference output and C' is the borrow output. Truth Table Circuit Diagram 2.3 MULTIPLICATION Multiplication of decimal numbers in long hand can be used to show the steps of multiplication and the names of the operands. Binary multiplication is similar to decimal multiplication. It is simpler than decimal multiplication because only 0s and 1s are involved. There are four rules of binary multiplication. Example The number of digits in the product is considerably larger than the number in either the multiplicand or the multiplier. The length of the multiplication of an n-bit multiplicand and an m-bit multiplier is a product that is n + m bits long (sign bit is ignored). So, n + m bits are required to represent all possible products. In this case one has to consider Over flow condition also. The multiplicand is moved to left each time and the multiplier moved right after each bit has performed its intermediate execution. The No of iterations to find the product will be equal to No: of bits in the multiplier. In this case we have 32 iterations (MIPS). Example Multiply 2ten _ 3ten, or 0010two _ 0011two. (4 bits are used to save space.) Booth's Algorithm of Multiplication General Steps of Booth's Algorithm:- Step 1:-In step 1 firstly we take a multiplicand M and multiplier Q(R) and set value of A,Q(n+1),SC are 0,0,0 respectively. Step 2:-In step 2 We check Q(0) and Q(1). Step 3:-In step 3 if bits are 0,1 then add M with A and after that perform Right Shift Operation. Step 4:-If bits are 1,0 then perform A+(M)'+1 then perform Right Shift Operation. Step 5:-Check if SC is set as o. Step 6:-Repeat Step 2,3,4 until Count<--0. Flow Chart/Diagram of Booth's Algorithm of Multiplication:- 2.4 DIVISION A division algorithm is an algorithm which, given two integers N and D, computes their quotient and/or remainder, the result of division. 1. Align dividend and divisor with their most significant digits 2. Test how many times n the divisor fits into the locally aligned dividend 3. n is the value of the quotient digit 4. Subtract divisor n times from the locally aligned dividend 5. Extend local remainder by the next less significant digit of the dividend, thus forming a new local dividend 6. Repeat steps 2. – 5. Until all digits of the dividend are considered 7. The local remainder after the last subtraction is the remainder of the division Binary Division Hardware •Each iteration of the algorithm needs to move the divisor to the right one digit, so we start with the divisor placed in the left half of the 64-bit Divisor register and shift it right 1 bit each step. •The Remainder register is initialized with the dividend. Example − Division Division algorithms fall into two main categories: slow division and fast division. Slow division algorithms produce one digit of the final quotient per iteration. Examples of slow division include restoring, non-performing restoring, non-restoring, and SRT division. Fast division methods start with a close approximation to the final quotient and produce twice as many digits of the final quotient on each iteration. Newton–Raphson and Goldschmidt fall into this category. Sequential Restoring Division • A shift register keeps both the (remaining) dividend as well as the quotient • With each cycle, dividend decreases by one digit & quotient increases by one digit • The MSB‘s of the remaining dividend and the divisor are aligned in each cycle • Major difference to multiplication: 1. we do not know if we can subtract the divisor or not 2. if the subtraction failed, we have to restore the original dividend Procedure 1. Load the 2n dividend into both halves of shift register, and add a sign bit to the left 2. Add a sign bit to the left of the divisor 3. Generate the 2‘s complement of the divisor 4. Shift to the left 5. Add 2‘s complement of the divisor to the upper half of the shift register including sign bit (subtract) 6. If sign of the result is cleared (positive) • then set LSB of the lower half of the shift register to one • else clear LSB of the lower half and add the divisor to upper half of shift register 7. repeat from 4. and perform the loop n times 8. after termination: • lower half of shift register ⇒ quotient • upper half of shift register ⇒ remainder Restoring Algorithm Assume ─ X register k-bit dividend Assume ─ Y the k-bit divisor Assume ─ S a sign-bit Start: Load 0 into accumulator k-bit A and dividend X is loaded into the k-bit quotient register MQ. Step A : Shift 2 k-bit register pair A -MQ left Step B: Subtract the divisor Y from A. Step C: If sign of A (msb) = 1, then reset MQ 0(lsb) = 0 else set = 1. Steps D: If MQ 0 = 0 add Y (restore the effect of earlier subtraction). 6. Steps A to D repeat again till the total number of cyclic operations = k. At the end, A has the remainder and MQ has the quotient A restoring-division example Initially 00000 1000 00011 Shift 00001 Subtract 111 01 Set q0 11110 Restore 000 First cycle 11 00001 0000 Shift 00010 000 Subtract 11101 Set q0 11111 Restore Second cycle 11 00010 0000 Shift 00100 000 Subtract 11101 Set q0 00010 0001 Shift 00010 001 Subtract 11101 Set q0 11111 Restore Fourth cycle 11 00010 Third cycle 0010 Quotient remainder The non-restoring division algorithm: S1: DO n times Shift A and Q left one binary position. Subtract M from A, placing the answer back in A. If the sign of A is 1, set q0 to 0 and add M back to A (restore A); otherwise, set q0 to 1. S1: Do n times If the sign of A is 0, shift A and Q left one binary position and subtract M from A; otherwise, shift A and Q left and add M to A. S2: If the sign of A is 1, add M to A. Floating-Point Number Representation A floating-point number (or real number) can represent a very large (1.23×10^88) or a very small (1.23×10^-88) value. It could also represent very large negative number (-1.23×10^88) and very small negative number (-1.23×10^88), as well as zero, as illustrated: A floating-point number is typically expressed in the scientific notation, with a fraction (F), and an exponent (E) of a certain radix (r), in the form of F×r^E. Decimal numbers use radix of 10 (F×10^E); while binary numbers use radix of 2 (F×2^E). Representation of floating point number is not unique. For example, the number 55.66 can be represented as 5.566×10^1, 0.5566×10^2, 0.05566×10^3, and so on. The fractional part can be normalized. In the normalized form, there is only a single non-zero digit before the radix point. For example, decimal number 123.4567 can be normalized as 1.234567×10^2; binary number 1010.1011B can be normalized as 1.0101011B×2^3. It is important to note that floating-point numbers suffer from loss of precision when represented with a fixed number of bits (e.g., 32-bit or 64-bit). This is because there are infinite number of real numbers (even within a small range of says 0.0 to 0.1). On the other hand, a n-bit binary pattern can represent a finite 2^n distinct numbers. Hence, not all the real numbers can be represented. The nearest approximation will be used instead, resulted in loss of accuracy. It is also important to note that floating number arithmetic is very much less efficient than integer arithmetic. It could be speed up with a so-called dedicated floating-point coprocessor. Hence, use integers if your application does not require floating-point numbers. In computers, floating-point numbers are represented in scientific notation of fraction (F) and exponent (E) with a radix of 2, in the form of F×2^E. Both E and F can be positive as well as negative. Modern computers adopt IEEE 754 standard for representing floating-point numbers. There are two representation schemes: 32-bit single-precision and 64-bit doubleprecision. IEEE-754 32-bit Single-Precision Floating-Point Numbers In 32-bit single-precision floating-point representation: The most significant bit is the sign bit (S), with 0 for positive numbers and 1 for negative numbers. The following 8 bits represent exponent (E). The remaining 23 bits represents fraction (F). Normalized Form Let's illustrate with an example, suppose that the 32-bit pattern is 1 1000 0001 011 0000 0000 0000 0000 0000, with: S=1 E = 1000 0001 F = 011 0000 0000 0000 0000 0000 In the normalized form, the actual fraction is normalized with an implicit leading 1 in the form of 1.F. In this example, the actual fraction is 1.011 0000 0000 0000 0000 0000 = 1 + 1×2^-2 + 1×2^3 = 1.375D. The sign bit represents the sign of the number, with S=0 for positive and S=1 for negative number. In this example with S=1, this is a negative number, i.e., -1.375D. In normalized form, the actual exponent is E-127 (so-called excess-127 or bias-127). This is because we need to represent both positive and negative exponent. With an 8-bit E, ranging from 0 to 255, the excess-127 scheme could provide actual exponent of -127 to 128. In this example, E-127=129-127=2D. Hence, the number represented is -1.375×2^2=-5.5D. De-Normalized Form Normalized form has a serious problem, with an implicit leading 1 for the fraction, it cannot represent the number zero! Convince yourself on this! De-normalized form was devised to represent zero and other numbers. For E=0, the numbers are in the de-normalized form. An implicit leading 0 (instead of 1) is used for the fraction; and the actual exponent is always -126. Hence, the number zero can be represented with E=0 and F=0 (because 0.0×2^-126=0). We can also represent very small positive and negative numbers in de-normalized form with E=0. For example, if S=1, E=0, and F=011 0000 0000 0000 0000 0000. The actual fraction is 0.011=1×2^-2+1×2^-3=0.375D. Since S=1, it is a negative number. With E=0, the actual exponent is -126. Hence the number is -0.375×2^-126 = -4.4×10^-39, which is an extremely small negative number (close to zero). IEEE Standard 754 Floating Point Numbers There are several ways to represent real numbers on computers. Fixed point places a radix point somewhere in the middle of the digits, and is equivalent to using integers that represent portions of some unit. For example, one might represent 1/100ths of a unit; if you have four decimal digits, you could represent 10.82, or 00.01. Another approach is to use rationales, and represent every number as the ratio of two integers. Floating-point representation – the most common solution – uses scientific notation to encode numbers, with a base number and an exponent. For example, 123.456 could be represented as 1.23456 × 102. In hexadecimal, the number 123.abc might be represented as 1.23abc × 162. In binary, the number 10100.110 could be represented as 1.0100110 × 24. Floating-point solves a number of representation problems. Fixed-point has a fixed window of representation, which limits it from representing very large or very small numbers. Also, fixed-point is prone to a loss of precision when two large numbers are divided. Floating-point, on the other hand, employs a sort of "sliding window" of precision appropriate to the scale of the number. This allows it to represent numbers from 1,000,000,000,000 to 0.0000000000000001 with ease, and while maximizing precision (the number of digits) at both ends of the scale. Storage Layout IEEE floating point numbers have three basic components: the sign, the exponent, and the mantissa. The mantissa is composed of the fraction and an implicit leading digit (explained below). The exponent base (2) is implicit and need not be stored. The following figure shows the layout for single (32-bit) and double (64-bit) precision floating-point values. The number of bits for each field are shown (bit ranges are in square brackets, 00 = least-significant bit): Floating Point Components Sign Exponent Fraction Single Precision 1 [31] 8 [30–23] 23 [22–00] Double Precision 1 [63] 11 [62–52] 52 [51–00] . The Sign Bit The sign bit is as simple as it gets. 0 denotes a positive number, and 1 denotes a negative number. Flipping the value of this bit flips the sign of the number. The Exponent The exponent field needs to represent both positive and negative exponents. To do this, a bias is added to the actual exponent in order to get the stored exponent. For IEEE single-precision floats, this value is 127. Thus, an exponent of zero means that 127 is stored in the exponent field. A stored value of 200 indicates an exponent of (200–127), or 73. For reasons discussed later, exponents of −127 (all 0s) and +128 (all 1s) are reserved for special numbers. The Mantissa The mantissa, also known as the significand, represents the precision bits of the number. It is composed of an implicit leading bit (left of the radix point) and the fraction bits (to the right of the radix point). To find out the value of the implicit leading bit, consider that any number can be expressed in scientific notation in many different ways. For example, the number 50 can be represented as any of these: .5000 × 102 0.050 × 103 5000. × 10−2 In order to maximize the quantity of representable numbers, floating-point numbers are typically stored in normalized form. This basically puts the radix point after the first non-zero digit. In normalized form, five is represented as 5.000 × 100. 2.6 SUB WORD PARALLELISM. A subword is a lower precision unit of data contained within a word. In subword parallelism, multiple subwords are packed into a word and then process whole words. With the appropriate subword boundaries this technique results in parallel processing of subwords. Since the same instruction is applied to all subwords within the word, This is a form of SIMD(Single Instruction Multiple Data) processing. It is possible to apply subword parallelism to non contiguous subwords of different sizes within a word. In practical implementation is simple if subwords are same size and they are contiguous within a word. The data parallel programs that benefit from subword parallelism tend to process data that are of the same size. For example if word size is 64bits and subwords sizes are 8,16 and 32 bits. Hence an instruction operates on eight 8bit subwords, four 16bit subwords, two 32bit subwords or one 64bit subword in parallel. Subword parallelism is an efficient and flexible solution for media processing because algorithm exhibit a great deal of data parallelism on lower precision data. It is also useful for computations unrelated to multimedia that exhibit data parallelism on lower precision data. Graphics and audio applications can take advantage of performing simultaneous operations on short vectors Example: 128-bit adder: Sixteen 8-bit adds Eight 16-bit adds Four 32-bit adds Also called data-level parallelism, vector parallelism, or Single Instruction, Multiple Data (SIMD) UNIT III PROCESSOR AND CONTROL UNIT Basic MIPS implementation – Building data path – Control Implementation scheme – Pipelining – Pipelined data path and control – Handling Data hazards & Control hazards – Exceptions. 3.1 Basic MIPS implementation 3.2 Building data path 3.3 Control Implementation scheme 3.4 Pipelining 3.5 Pipelined data path and control 3.6 Handling Data hazards & Control hazards 3.7 Exceptions. 3.1 A BASIC MIPS IMPLEMENTATION We will be examining an implementation that includes a subset of the core MIPS instruction set:  The memory-reference instructions load word (lw) and store word (sw)  The arithmetic-logical instructions add, sub, AND, OR, and slt  The instructions branch equal (beq) and jump (j), which we add last This subset does not include all the integer instructions (for example, shift, multiply, and divide are missing), nor does it include any floating-point instructions. However, the key principles used in creating a data path and designing the control are illustrated. In examining the implementation, we will have the opportunity to see how the instruction set architecture determines aspects of the implementation, and how the choice of various implementation strategies affects the clock rate and CPI for the computer. In addition, most concepts used to implement the MIPS subset in this chapter are the same basic ideas that are used to construct a broad spectrum of computers, from high-performance servers to general- purpose microprocessors to embedded processors. An Overview of the Implementation The core MIPS instructions, including the integer arithmetic-logical instructions, the memory-reference instructions, and the branch instructions. Much of what needs to be done to implement these instructions is the same, independent of the exact class of instruction. For every instruction, the first two steps are identical: 1. Send the program counter (PC) to the memory that contains the code and fetch the instruction from that memory. 2. Read one or two registers, using fields of the instruction to select the registers to read. For the load word instruction, we need to read only one register, but most other instructions require that we read two registers. After these two steps, the actions required to complete the instruction depend on the instruction class. Fortunately, for each of the three instruction classes (memory-reference, arithmetic-logical, and branches), the actions are largely the same, independent of the exact instruction. The simplicity and regularity of the MIPS instruction set simplifies the implementation by making the execution of many of the instruction classes similar. For example, all instruction classes, except jump, use the arithmetic-logical unit (ALU) after reading the registers. The memory-reference instructions use the ALU for an address calculation, the arithmetic-logical instructions for the operation execution, and branches for comparison. After using the ALU, the actions required to complete various instruction classes differ. A memory-reference instruction will need to access the memory either to read data for a load or write data for a store. An arithmetic-logical or load instruction must write the data from the ALU or memory back into a register. Lastly, for a branch instruction, we may need to change the next instruction address based on the comparison; otherwise, the PC should be incremented by 4 to get the address of the next instruction. Figure 3.1 shows the high-level view of a MIPS implementation, focusing on the various functional units and their interconnection. Although this figure shows most of the flow of data through the processor, it omits two important aspects of instruction execution. FIGURE 3.1 An abstract view of the implementation of the MIPS subset showing the major functional units and the major connections between them. All instructions start by using the program counter to supply the instruction address to the instruction memory. After the instruction is fetched, the register operands used by an instruction are specified by fields of that instruction. Once the register operands have been fetched, they can be operated on to compute a memory address (for a load or store), to compute an arithmetic result (for an integer arithmetic-logical instruction), or a compare (for a branch). If the instruction is an arithmetic-logical instruction, the result from the ALU must be written to a register. If the operation is a load or store, the ALU result is used as an address to either store a value from the registers or load a value from memory into the registers. The result from the ALU or memory is written back into the register file. Branches require the use of the ALU output to determine the next instruction address, which comes either from the ALU (where the PC and branch offset are summed) or from an adder that increments the current PC by 4. The thick lines interconnecting the functional units represent buses, which consist of multiple signals. The arrows are used to guide the reader in knowing how information flows. Since signal lines may cross, we explicitly show when crossing lines are connected by the presence of a dot where the lines cross. Figure 3.1 shows data going to a particular unit as coming from two different sources. For example, the value written into the PC can come from one of two adders, the data written into the register file can come from either the ALU or the data memory, and the second input to the ALU can come from a register or the immediate field of the instruction. In practice, these data lines cannot simply be wired together; we must add a logic element that chooses from among the multiple sources and steers one of those sources to its destination. This selection is commonly done with a device called a multiplexor, although this device might better be called a data selector. The control lines are set based primarily on information taken from the instruction being executed. The the data memory must read on a load and write on a store. The register file must be written on a load and an arithmetic-logical instruction. And, of course, the ALU must perform one of several operations. Like the multiplexors, these operations are directed by control lines that are set on the basis of various fields in the instruction. Figure 3.2 shows the data path of Figure 3.1 with the three required multiplexors added, as well as control lines for the major functional units. A control unit, which has the instruction as an input, is used to determine how to set the control lines for the functional units and two of the multiplexors. The third multiplexor, which determines whether PC + 4 or the branch destination address is written into the PC, is set based on the Zero output of the ALU, which is used to perform the comparison of a beq instruction. The regularity and simplicity of the MIPS instruction set means that a simple decoding process can be used to determine how to set the control lines. Logic Design Conventions FIGURE 3.2 The basic implementation of the MIPS subset, including the necessary multiplexors and control lines. The top multiplexor (―Mux‖) controls what value replaces the PC (PC + 4 or the branch destination address); the multiplexor is controlled by the gate that ―ANDs‖ together the Zero output of the ALU and a control signal that indicates that the instruction is a branch. The middle multiplexor, whose output returns to the register file, is used to steer the output of the ALU (in the case of an arithmetic-logical instruction) or the output of the data memory (in the case of a load) for writing into the register file. Finally, the bottommost multiplexor is used to determine whether the second ALU input is from the registers (for an arithmetic-logical instruction OR a branch) or from the offset field of the instruction (for a load or store). The added control lines are straightforward and determine the operation performed at the ALU, whether the data memory should read or write, and whether the registers should perform a write operation. The data path elements in the MIPS implementation consist of two different types of logic elements: elements that operate on data values and elements that contain state. The elements that operate on data values are all combinational, which means that their outputs depend only on the current inputs. Given the same input, a combinational element always produces the same output. A state element has at least two inputs and one output. The required inputs are the data value to be written into the element and the clock, which determines when the data value is written. The output from a state element provides the value that was written in an earlier clock cycle. Logic components that contain state are also called sequential, because their outputs depend on both their inputs and the contents of the internal state. A clocking methodology defines when signals can be read and when they can be written. It is important to specify the timing of reads and writes, because if a signal is written at the same time it is read, the value of the read could correspond to the old value, the newly written value, or even some mix of the two! Computer designs cannot tolerate such unpredictability. A clocking methodology is designed to ensure predictability. An edge-triggered clocking methodology means that any values stored in a sequential logic element are updated only on a clock edge. Because only state elements can store a data value, any collection of combinational logic must have its inputs come from a set of state elements and its outputs written into a set of state elements. The inputs are values that were written in a previous clock cycle, while the outputs are values that can be used in a following clock cycle. Figure3 .3 shows the two state elements surrounding a block of combinational logic, which operates in a single clock cycle: all signals must propagate from state element 1, through the combinational logic, and to state element 2 in the time of one clock cycle. The time necessary for the signals to reach state element 2 defines the length of the clock cycle FIGURE 3.3 Combinational logic, state elements, and the clock are closely related. In a synchronous digital system, the clock determines when elements with state will write values into internal storage. Any inputs to a state element must reach a stable value (that is, have reached a value from which they will not change until after the clock edge) before the active clock edge causes the state to be updated. Control signal when a state element is written on every active clock edge. In contrast, if a State element is not updated on every clock, then an explicit write control signal is required. Both the clock signal and the write control signal are inputs, and the state element is changed only when the write control signal is asserted and a clock edge occurs. An edge-triggered methodology allows us to read the contents of a register, send the value through some combinational logic, and write that register in the same clock cycle. FIGURE 3.4 An edge-triggered methodology allows a state element to be read and written in the same clock cycle without creating a race that could lead to indeterminate data values. The clock cycle still must be long enough so that the input values are stable when the active clock edge occurs. Feedback cannot occur within one clock cycle because of the edgetriggered update of the state element. If feedback were possible, this design could not work properly. 3.2 BUILDING A DATAPATH Figure 3.5a shows the first element we need: a memory unit to store the instructions of a program and supply instructions given an address. Figure 3.5b also shows the program counter (PC), a register that holds the address of the current instruction. Lastly, we will need an adder to increment the PC to the address of the next instruction. This adder, which is combinational, can be built from the ALU described by wiring the control lines so that the control always specifies an add Operation. FIGURE 3.5 Two state elements are needed to store and access instructions, and an adder is needed to compute the next instruction address. The state elements are the instruction memory and the program counter. The instruction memory need only provide read access because the data path does not write instructions. Since the instruction memory only reads, we treat it as combinational logic: the output at any time reflects the contents of the location specified by the address input, and no read control signal is needed. (We will need to write the instruction memory when we load the program; this is not hard to add, and we ignore it for simplicity.) The program counter is a 32‑bit register that is written at the end of every clock cycle and thus does not need a write control signal. The adder is an ALU wired to always add its two 32‑bit inputs and place the sum on its output. FIGURE 3.6 A portion of the data path used for fetching instructions and incrementing the program counter. We will draw such an ALU with the label Add, as in Figure 3.5, to indicate that it has been permanently made an adder and cannot perform the other ALU functions. To execute any instruction, we must start by fetching the instruction from memory. To prepare for executing the next instruction, we must also increment the program counter so that it points at the next instruction, 4 bytes later. Figure 3.6 shows how to combine the three elements from Figure 3.5 to form a data path that fetches instructions and increments the PC to obtain the address of the next sequential instruction. The processor‘s 32 general-purpose registers are stored in a structure called a register file. A register file is a collection of registers in which any register can be read or written by specifying the number of the register in the file. The register file contains the register state of the computer. In addition, we will need an ALU to operate on the values read from the registers. R-format instructions have three register operands, so we will need to read two data words from the register file and write one data word into the register file for each instruction. For each data word to be read from the registers, we need an input to the register file that specifies the register number to be read and an output from the register file that will carry the value that has been read from the registers. To write a data word, we will need two inputs: one to specify the register number to be written and one to supply the data to be written into the register. The register file always outputs the contents of whatever register numbers are on the Read register inputs. Writes, however, are controlled by the write control signal, which must be asserted for a write to occur at the clock edge. FIGURE 3.7 The two elements needed to implement R-format ALU operations are the register file and the ALU. The register file contains all the registers and has two read ports and one write port. The design of multiport register files. The register file always outputs the contents of the registers corresponding to the Read register inputs on the outputs; no other control inputs are needed. In contrast, a register write must be explicitly indicated by asserting the write control signal. Remember that writes are edge-triggered, so that all the write inputs (i.e., the value to be written, the register number, and the write control signal) must be valid at the clock edge. Since writes to the register file are edge-triggered, our design can legally read and write the same register within a clock cycle: the read will get the value written in an earlier clock cycle, while the value written will be available to a read in a subsequent clock cycle. The inputs carrying the register number to the register file are all 5 bits wide, whereas the lines carrying data values are 32 bits wide. The operation to be performed by the ALU is controlled with the ALU operation signal, which will be 4 bits wide, using the ALU designed. We will use the Zero detection output of the ALU shortly to implement branches. The overflow output will not be needed we will need a unit to sign-extend the 16‑bit offset field in the instruction to a 32‑bit signed value, and a data memory unit to read from or write to. The data memory must be written on store instructions; hence, data memory has read and write control signals, an address input, and an input for the data to be written into memory. The beq instruction has three operands, two registers that are compared for equality, and a 16‑bit offset used to compute the branch target address relative to the branch instruction address. Its form is beq $t1,$t2,offset. To implement this instruction, we must compute the branch target address by adding the signextended offset field of the instruction to the PC. The instruction set architecture specifies that the base for the branch address calculation is the address of the instruction following the branch. Since we compute PC + 4 (the address of the next instruction) in the instruction fetch datapath, it is easy to use this value as the base for computing the branch target address. The architecture also states that the offset field is shifted left 2 bits so that it is a word offset; this shift increases the effective range of the offset field by a factor of 4. FIGURE 3.8 The two units needed to implement loads and stores, in addition to the register file and ALU of Figure 3.7 The data memory unit and the sign extension unit. The memory unit is a state element with inputs for the address and the write data, and a single output for the read result. There are separate read and write controls, although only one of these may be asserted on any given clock. The memory unit needs a read signal, since, unlike the register file, reading the value f an invalid address can cause problems FIGURE 3.9 The data path for a branch uses the ALU to evaluate the branch condition and a separate adder to compute the branch target as the sum of the incremented PC and the sign-extended, lower 16 bits of the instruction (the branch displacement), shifted left 2 bits. The unit labelled Shift left 2 is simply a routing of the signals between input and output that adds 00two to the low-order end of the sign-extended offset field; no actual shift hardware is needed, since the amount of the ―shift‖ is constant. Since we know that the offset was signextended from 16 bits, the shift will throw away only ―sign bits.‖ Control logic is used to decide whether the incremented PC or branch target should replace the PC, based on the Zero output of the ALU. Creating a Single Datapath This simplest datapath will attempt to execute all instructions in one clock cycle. This means that no datapath resource can be used more than once per instruction, so any element needed more than once must be duplicated. We therefore need a memory for instructions separate from one for data. Although some of the functional units will need to be duplicated, many of the elements can be shared by different instruction flows. To share a datapath element between two different instruction classes, we may need to allow multiple connections to the input of an element, using a multiplexor and control signal to select among the multiple inputs. Building a Datapath The operations of arithmetic-logical (or R-type) instructions and the memory instructions datapath are quite similar. The key differences are the following: The arithmetic-logical instructions use the ALU, with the inputs coming from the two registers. The memory instructions can also use the ALU to do the address calculation, although the second input is the sign-extended 16-bit offset field from the instruction. The value stored into a destination register comes from the ALU (for an R-type instruction) or the memory (for a load). To create a datapath with only a single register file and a single ALU, we must support two different sources for the second ALU input, as well as two different sources for the data stored into the register file. Thus, one multiplexor is placed at the ALU input and another at the data input to the register file. FIGURE 3.10 The datapath for the memory instructions and the R-type instructions. 3.3 CONTROL IMPLEMENTATION SCHEME Implementation Scheme We build this simple implementation using the datapath of the last section and adding a simple control function. This simple implementation covers load word (lw), store word (sw), branch equal (beq), and the arithmetic-logical instructions add, sub, AND, OR, and set on less than. We will later enhance the design to include a jump instruction (j). The ALU Control The MIPS ALU defines the 6 following combinations of four control inputs: Depending on the instruction class, the ALU will need to perform one of these first five functions. (NOR is needed for other parts of the MIPS instruction set not found in the subset we are implementing.) For load word and store word instructions, we use the ALU to compute the memory address by addition. For the R-type instructions, the ALU needs to perform one of the five actions (AND, OR, subtract, add, or set on less than), depending on the value of the 6‑bit funct (or function) field in the low-order bits of the instruction. For branch equal, the ALU must perform a subtraction. We can generate the 4‑bit ALU control input using a small control unit that has as inputs the function field of the instruction and a 2‑bit control field, which we call ALUOp. ALUOp indicates whether the operation to be performed should be add (00) for loads and stores, subtract (01) for beq, or determined by the operation encoded in the funct field (10). The output of the ALU control unit is a 4‑bit signal that directly controls the ALU by generating one of the 4‑bit combinations we show how to set the ALU control inputs based on the 2‑bit ALUOp control and the 6‑bit function code. Later in this chapter we will see how the ALUOp bits are generated from the main control unit. FIGURE 3.11 How the ALU control bits are set depends on the ALUOp control bits and the different function codes for the R-type instruction. The opcode, listed in the first column, determines the setting of the ALUOp bits. All the encodings are shown in binary. Notice that when the ALUOp code is 00 or 01, the desired ALU action does not depend on the function code field; in this case, we say that we ―don‘t care‖ about the value of the function code, and the funct field is shown as XXXXXX. When the ALUOp value is 10, then the function code is used to set the ALU control input. This style of using multiple levels of decoding—that is, the main control unit generates the ALUOp bits, which then are used as input to the ALU control that generates the actual signals to control the ALU unit—is a common implementation technique. Using multiple levels of control can reduce the size of the main control unit. Using several smaller control units may also potentially increase the speed of the control unit. Such optimizations are important, since the speed of the control unit is often critical to clock cycle time. There are several different ways to implement the mapping from the 2‑bit ALUOp field and the 6‑bit funct field to the four ALU operation control bits. Because only a small number of the 64 possible values of the function field are of interest and the function field is used only when the ALUOp bits equal 10, we can use a small piece of logic that recognizes the subset of possible values and causes the correct setting of the ALU control bits. Create a truth table for the interesting combinations of the function code field and the ALUOp bits, as we‘ve done in Figure 3.12; this truth table shows how the 4‑bit ALU control is set depending on these two input fields. Since the full truth table is very large (28 = 256 entries) and we don‘t care about the value of the ALU control for many of these input combinations, we show only the truth table entries for which the ALU control must have a specific value. Throughout this chapter, we will use this practice of showing only the truth table entries for outputs that must be asserted and not showing those that are all deasserted or dont care Because in many instances we do not care about the values of some of the inputs, and because we wish to keep the tables compact, we also include don‘t-care terms. A don‘t-care term in this truth table (represented by an X in an input column) indicates that the output does not depend on the value of the input corresponding to that column. For example, when the ALUOp bits are 00, as in the first row of Figure 3.12, we always set the ALU control to 0010, independent of the function code. In this case, then, the function code inputs will be don‘t cares in this line of the truth table. Once the truth table has been constructed, it can be optimized and then turned into gates. FIGURE 3.12 The truth table for the 4 ALU control bits (called Operation). The inputs are the ALUOp and function code field. Only the entries for which the ALU control is asserted are shown. Some don‘t-care entries have been added. For example, the ALUOp does not use the encoding 11, so the truth table can contain entries 1X and X1, rather than 10 and 01. Note that when the function field is used, the first 2 bits (F5 and F4) of these instructions are always 10, so they are don‘t-care terms and are replaced with XX in the truth table. Designing the Main Control Unit FIGURE The three instruction classes (R-type, load and store, and branch) use two different instruction formats. The jump instructions use another format, which we will discuss shortly. (a) Instruction format for R-format instructions, which all have an opcode of 0. These instructions have three register operands: rs, rt, and rd. Fields rs and rt are sources, and rd is the destination. The ALU function is in the funct field and is decoded by the ALU control design in the previous section. The R-type instructions that we implement are add, sub, AND, OR, and slt. The shamt field is used only for shifts; we will ignore it in this chapter. (b) Instruction format for load (opcode = 35ten) and store (opcode = 43ten) instructions. The register rs is the base register that is added to the 16‑bit address field to form the memory address. For loads, rt is the destination register for the loaded value. For stores, rt is the source register whose value should be stored into memory. (c) Instruction format for branch equal (opcode = 4). The registers rs and rt are the source registers that are compared for equality. The 16‑bit address field is sign-extended, shifted, and added to the PC+4 to compute the branch target address. There are several major observations about this instruction format that we will rely on:  The op field, also called the opcode, is always contained in bits 31:26. We will refer to this field as Op[5:0].     The two registers to be read are always specified by the rs and rt fields, at positions 25:21 and 20:16. This is true for the R-type instructions, branch equal, and store. The base register for load and store instructions is always in bit positions 25:21 (rs). The 16‑bit offset for branch equal, load, and store is always in positions 15:0. The destination register is in one of two places. For a load it is in bit positions 20:16 (rt), while for an R-type instruction it is in bit positions 15:11 (rd). Thus, we will need to add a multiplexor to select which field of the instruction is used to indicate the register number to be written. Figure 3.14 shows these additions plus the ALU control block, the write signals for state elements, the read signal for the data memory, and the control signals for the multiplexors. Since all the multiplexors have two inputs, they each require a single control line. Figure 3.14 shows seven single bit control lines plus the 2‑bit ALUOp control signal. We have already defined how the ALUOp control signal works, and it is useful to define what the seven other control signals do informally before we determine how to set these control signals during instruction execution FIGURE 3.14 The data path of with all necessary multiplexors and all control lines identified. The control lines are shown in color. The ALU control block has also been added. The PC does not require a write control, since it is written once at the end of every clock cycle; the branch control logic determines whether it is written with the incremented PC or the branch target address. FIGURE 3.15 The effect of each of the seven control signals. When the 1‑bit control to a two way multiplexor is asserted, the multiplexor selects the input corresponding to 1. Otherwise, if the control is deasserted, the multiplexor selects the 0 input. Remember that the state elements all have the clock as an implicit input and that the clock is used in controlling writes. Gating the clock externally to a state element can create timing problems. Operation of the Datapath FIGURE 3.16 The simple datapath with the control unit. The input to the control unit is the 6‑bit opcode field from the instruction. The outputs of the control unit consist of three 1‑bit signals that are used to control multiplexors (RegDst, ALUSrc, and MemtoReg), three signals for controlling reads and writes in the register file and data memory (RegWrite, MemRead, and MemWrite), a 1‑bit signal used in determining whether to possibly branch (Branch), and a 2‑bit control signal for the ALU (ALUOp). An AND gate is used to combine the branch control signal and the Zero output from the ALU; the AND gate output controls the selection of the next PC. Notice that PCSrc is now a derived signal, rather than one coming directly from the control unit. Thus, we drop the signal name in subsequent figures. FIGURE 3.17 The setting of the control lines is completely determined by the opcode fields of the instruction. The first row of the table corresponds to the R-format instructions (add, sub, AND, OR, and slt). For all these instructions, the source register fields are rs and rt, and the destination register field is rd; this defines how the signals ALUSrc and RegDst are set. Furthermore, an R-type instruction writes a register (RegWrite = 1), but neither reads nor writes data memory. When the Branch control signal is 0, the PC is unconditionally replaced with PC + 4; otherwise, the PC is replaced by the branch target if the Zero output of the ALU is also high. The ALUOp field for R‑type instructions is set to 10 to indicate that the ALU control should be generated from the funct field. The second and third rows of this table give the control signal settings for lw and sw. These ALUSrc and ALUOp fields are set to perform the address calculation. The MemRead and MemWrite are set to perform the memory access. Finally, RegDst and RegWrite are set for a load to cause the result to be stored into the rt register. The branch instruction is similar to an R-format operation, since it sends the rs and rt registers to the ALU. The ALUOp field for branch is set for a subtract (ALU control = 01), which is used to test for equality. Notice that the MemtoReg field is irrelevant when the RegWrite signal is 0: since the register is not being written, the value of the data on the register data write port is not used. Thus, the entry MemtoReg in the last two rows of the table is replaced with X for don‘t care. Don‘t cares can also be added to RegDst when RegWrite is 0. This type of don‘t care must be added by the designer, since it depends on knowledge of how the datapath works. FIGURE 3.18 The datapath in operation for an R-type instruction, such as add $t1,$t2,$t3. The control lines, datapath units, and connections that are active are highlighted. Figure 3.18 shows the operation of the datapath for an R-type instruction, such as add $t1,$t2,$t3. Although everything occurs in one clock cycle, we can think of four steps to execute the instruction; these steps are ordered by the flow of information: 1. The instruction is fetched, and the PC is incremented. 2. Two registers, $t2 and $t3, are read from the register file; also, the main control unit computes the setting of the control lines during this step. 3. The ALU operates on the data read from the register file, using the function code (bits 5:0, which is the funct field, of the instruction) to generate the ALU function. 4. The result from the ALU is written into the register file using bits 15:11 of the instruction to select the destination register ($t1). Similarly, we can illustrate the execution of a load word, such as lw $t1, offset($ t2) in a style similar to Figure 3.18. Figure 3.19 shows the active functional units and asserted control lines for a load. We can think of a load instruction as operating in five steps (similar to the R-type executed in four): 1. An instruction is fetched from the instruction memory, and the PC is incremented. 2. A register ($t2) value is read from the register file. 3. The ALU computes the sum of the value read from the register file and the sign-extended, lower 16 bits of the instruction (offset). 4. The sum from the ALU is used as the address for the data memory. 5. The data from the memory unit is written into the register file; the register destination is given by bits 20:16 of the instruction ($t1) . FIGURE 3.19 The datapath in operation for a load instruction. The control lines, datapath units, and connections that are active are highlighted. A store instruction would operate very similarly. The main difference would be that the memory control would indicate a write rather than a read, the second register value read would be used for the data to store, and the operation of writing the data memory value to the register file would not occur. Finally, we can show the operation of the branch-on-equal instruction, such as beq $t1,$t2,offset, in the same fashion. It operates much like an R‑format instruction, but the ALU output is used to determine whether the PC is written with PC + 4 or the branch target address. The four steps in execution are: 1. An instruction is fetched from the instruction memory, and the PC is incremented. 2. Two registers, $t1 and $t2, are read from the register file. 3. The ALU performs a subtract on the data values read from the register file. The value of PC + 4 is added to the sign-extended, lower 16 bits of the instruction (offset) shifted left by two; the result is the branch target address. 4. The Zero result from the ALU is used to decide which adder result to store into the PC. Finalizing Control FIGURE 3.20 The control function for the simple single-cycle implementation is completely specified by this truth table. The top half of the table gives the combinations of input signals that correspond to the four opcodes, one per column, that determine the control output settings. (Remember that Op [5:0] corresponds to bits 31:26 of the instruction, which is the op field.) The bottom portion of the table gives the outputs for each of the four opcodes. Thus, the output RegWrite is asserted for two different combinations of the inputs. If we consider only the four opcodes shown in this table, then we can simplify the truth table by using don‘t cares in the input portion. For example, we can detect an R-format instruction with the expression Op5 • Op2, since this is sufficient to distinguish the R- format instructions from lw, sw, and beq. We do not take advantage of this simplification, since the rest of the MIPS opcodes are used in a full implementation. 3.4 PIPELINING Pipelining is an implementation technique in which multiple instructions are overlapped in execution. MIPS instructions classically take five steps: 1. Fetch instruction from memory. 2. Read registers while decoding the instruction. The regular format of MIPS instructions allows reading and decoding to occur simultaneously. 3. Execute the operation or calculate an address. 4. Access an operand in data memory. 5. Write the result into a register. The pipelined approach takes much less time. The pipelining paradox is that the time from placing a single dirty sock in the washer until it is dried, folded, and put away is not shorter for pipelining; the reason pipelining is faster for many loads is that everything is working in parallel, so more loads are finished per hour. Pipelining improves throughput of our laundry system. Hence, pipelining would not decrease the time to complete one load of laundry, but when we have many loads of laundry to do, the improvement in throughput decreases the total time to complete the work. Ann, Brian, Cathy, and Don each have dirty clothes to be washed, dried, folded, and put away. The washer, dryer, ―folder,‖ and ―storer‖ each take 30 minutes for their task. Sequential laundry takes 8 hours for 4 loads of wash, while pipelined laundry takes just 3.5 hours. We show the pipeline stage of different loads over time by showing copies of the four resources on this two‑dimensional time line, but we really have just one of each resource. If all the stages take about the same amount of time and there is enough work to do, then the speed-up due to pipelining is equal to the number of stages in the pipeline, in this case four: washing, drying, folding, and putting away. Therefore, pipelined laundry is potentially four times faster than nonpipelined: 20 loads would take about 5 times as long as 1 load, while 20 loads of sequential laundry takes 20 times as long as 1 loads. It‘s only 2.3 times faster. Notice that at the beginning and end of the workload in the pipelined version, the pipeline is not completely full; this start-up and wind-down affects performance when the number of tasks is not large compared to the number of stages in the pipeline. If the number of loads is much larger than 4, then the stages will be full most of the time and the increase in throughput will be very close to 4. Total time for each instruction calculated from the time for each component. This calculation assumes that the multiplexors, control unit, PC accesses, and sign extension unit have no delay. If the stages are perfectly balanced, then the time between instructions on the pipelined processor—assuming ideal conditions—is equal to Time between instructions pipelined = Time between instructions nonpipelined / No of pipe stages Under ideal conditions and with a large number of instructions, the speed-up from pipelining is approximately equal to the number of pipe stages; a five-stage pipeline is nearly five times faster. The formula suggests that a five-stage pipeline should offer nearly a fivefold improvement over the 800 ps nonpipelined time, or a 160 ps clock cycle. The example shows, however, that the stages may be imperfectly balanced. Thus, the time per instruction in the pipelined processor will exceed the minimum possible, and speed-up will be less than the number of pipeline stages. FIGURE 3.21 Single-cycle, nonpipelined execution in top versus pipelined execution in bottom. In this case, we see a fourfold speed-up on average time between instructions, from 800 ps down to 200 ps. For the laundry, we assumed all stages were equal. If the dryer were slowest, then the dryer stage would set the stage time. The pipeline stage times of a computer are also limited by the slowest resource, either the ALU operation or the memory access. We assume the write to the register file occurs in the first half of the clock cycle and the read from the register file occurs in the second half. Pipelining improves performance by increasing instruction throughput, as opposed to decreasing the execution time of an individual instruction, but instruction throughput is the important metric because real programs execute billions of instructions. Designing Instruction Sets for Pipelining First, all MIPS instructions are the same length. This restriction makes it much easier to fetch instructions in the first pipeline stage and to decode them in the second stage. In an instruction set like the x86, where instructions vary from 1 byte to 17 bytes, pipelining is considerably more challenging. Recent implementations of the x86 architecture actually translate x86 instructions into simple operations that look like MIPS instructions and then pipeline the simple operations rather than the native x86 instructions! Second, MIPS has only a few instruction formats, with the source register fields being located in the same place in each instruction. This symmetry means that the second stage can begin reading the register file at the same time that the hardware is determining what type of instruction was fetched. If MIPS instruction formats were not symmetric, we would need to split stage 2, resulting in six pipeline stages. Third, memory operands only appear in loads or stores in MIPS. This restriction means we can use the execute stage to calculate the memory address and then access memory in the following stage. If we could operate on the operands in memory, as in the x86, stages 3 and 4 would expand to an address stage, memory stage, and then execute stage. Pipeline Hazards There are situations in pipelining when the next instruction cannot execute in the following clock cycle. These events are called hazards, and there are three different types. Structural Hazards The first hazard is called a structural hazard. It means that the hardware cannot support the combination of instructions that we want to execute in the same clock cycle. A structural hazard in the laundry room would occur if we used a washer dryer combination instead of a separate washer and dryer, or if our roommate was busy doing something else and wouldn‘t put clothes away. Our carefully scheduled pipeline plans would then be foiled. Data Hazards Data hazards occur when the pipeline must be stalled because one step must wait for another to complete. Suppose you found a sock at the folding station for which no match existed. One possible strategy is to run down to your room and search through your clothes bureau to see if you can find the match. Obviously, while you are doing the search, loads that have completed drying and are ready to fold and those that have finished washing and are ready to dry must wait. In a computer pipeline, data hazards arise from the dependence of one instruction on an earlier one that is still in the pipeline (a relationship that does not really exist when doing laundry). For example, suppose we have an add instruction followed immediately by a subtract instruction that uses the sum ($s0): add $s0, $t0, $t1 sub $t2, $s0, $t3 Without intervention, a data hazard could severely stall the pipeline. The add instruction doesn‘t write its result until the fifth stage, meaning that we would have to waste three clock cycles in the pipeline. Control Hazards The third type of hazard is called a control hazard, arising from the need to make a decision based on the results of one instruction while others are executing. Suppose our laundry crew was given the happy task of cleaning the uniforms of a football team. Given how filthy the laundry is, we need to determine whether the detergent and water temperature setting we select is strong enough to get the uniforms clean but not so strong that the uniforms wear out sooner. In our laundry pipeline, we have to wait until the second stage to examine the dry uniform to see if we need to change the washer setup or not. A more sophisticated version of branch prediction would have some branches predicted as taken and some as untaken. In our analogy, the dark or home uniforms might take one formula while the light or road uniforms might take another. In the case of programming, at the bottom of loops are branches that jump back to the top of the loop. Since they are likely to be taken and they branch backward, we could always predict taken for branches that jump to an earlier address. 3.5 PIPELINED DATA PATH AND CONTROL The division of an instruction into five stages : Five-stage pipeline, which in turn means that up to five instructions will be in execution during any single clock cycle. Thus, we must separate the datapath into five pieces, with each piece named corresponding to a stage of instruction execution: 1. IF: Instruction fetch 2. ID: Instruction decode and register file read 3. EX: Execution or address calculation 4. MEM: Data memory access 5. WB: Write back These five components correspond roughly to the way the data path is drawn; instructions and data move generally from left to right through the five stages as they complete execution. Returning to our laundry analogy, clothes get cleaner, drier, and more organized as they move through the line, and they never move backward. There are, however, two exceptions to this left-to-right flow of instructions:  The write-back stage, which places the result back into the register file in the middle of the datapath  The selection of the next value of the PC, choosing between the incremented PC and the branch address from the MEM stage Data flowing from right to left does not affect the current instruction; only later instructions in the pipeline are influenced by these reverse data movements. Note that the first right-to-left flow of data can lead to data hazards and the second leads to control hazards. One way to show what happens in pipelined execution is to pretend that each instruction has its own datapath, and then to place these datapaths on a timeline to show their relationship. FIGURE 3.22 The single-cycle datapath Each step of the instruction can be mapped onto the data path from left to right. The only exceptions are the update of the PC and the write-back step, shown in colour, which sends either the ALU result or the data from memory to the left to be written into the register file. (Normally we use color lines for control, but these are data lines.) FIGURE 3.23 Instructions being executed using the single-cycle Each stage is labeled by the physical resource used in that stage, corresponding to the portions of the datapath. IM represents the instruction memory and the PC in the instruction fetch stage, Reg stands for the register file and sign extender in the instruction decode/register file read stage (ID), and so on. To maintain proper time order, this stylized datapath breaks the register file into two logical parts: registers read during register fetch (ID) and registers written during write back (WB). This dual use is represented by drawing the unshaded left half of the register file using dashed lines in the ID stage, when it is not being written, and the unshaded right half in dashed lines in the WB stage, when it is not being read. FIGURE 3.24 The pipelined version of the datapath The pipeline registers, in colour, separate each pipeline stage. They are labelled by the stages that they separate; for example, the first is labelled IF/ID because it separates the instruction fetch and instruction decode stages. The registers must be wide enough to store all the data corresponding to the lines that go through them. For example, the IF/ID register must be 64 bits wide, because it must hold both the 32-bit instruction fetched from memory and the incremented 32-bit PC address. We will expand these registers over the course of this chapter, but for now the other three pipeline registers contain 128, 97, and 64 bits, respectively. 1. Instruction fetch: The top portion of Figure shows the instruction being read from memory using the address in the PC and then being placed in the IF/ID pipeline register. The PC address is incremented by 4 and then written back into the PC to be ready for the next clock cycle. This incremented address is also saved in the IF/ID pipeline register in case it is needed later for an branch instruction, such as beq. 2. Instruction decode and register file read: The bottom portion of Figure shows the instruction portion of the IF/ID pipeline register supplying the 16-bit immediate field of the load instruction( here in our example is 32), which is sign-extended to 32 bits, and the register numbers to read the two registers ( $s0 and $t0). All three values are stored in the ID/EX pipeline register, along with the incremented PC address. 3. Execute or address calculation: Figure shows that the load instruction reads the contents of register1 ($t0) and the sign-extended immediate (value 32) from the ID/EX pipeline register and adds them using the ALU. That sum is placed in the EX/MEM pipeline register. 4. Memory access: The top portion of Figure shows the load instruction reading the data memory using the address from the EX/MEM pipeline register and loading the data into the MEM/WB pipeline register. 5. Write-back: The bottom portion of Figure shows the final step: reading the data from the MEM/WB pipeline register and writing it into the register file ($s0) in the middle of the figure. FIGURE 3.25 EX: The third pipe stage of a store instruction. Pipeline Control FIGURE 3.26 The pipelined datapath with the control signals identified. This datapath borrows the control logic for PC source, register destination number, and ALU control Note that we now need the 6-bit funct field (function code) of the instruction in the EX stage as input to ALU control, so these bits must also be included in the ID/EX pipeline register. Recall that these 6 bits are also the 6 least significant bits of the immediate field in the instruction, so the ID/EX pipeline register can supply them from the immediate field since sign extension leaves these bits unchanged. FIGURE 3.27 The function of each of seven control signals is defined. The ALU control lines (ALUOp) are defined in the second column. When a 1-bit control to a 2-way multiplexor is asserted, the multiplexor selects the input corresponding to 1. Otherwise, if the control is deasserted, the multiplexor selects the 0 input. Note that PCSrc is controlled by an AND gate in Figure 4.46. If the Branch signal and the ALU Zero signal are both set, then PCSrc is 1; otherwise, it is 0. Control sets the Branch signal only during a beq instruction; otherwise, PCSrc is set to 0. To specify control for the pipeline, we need only set the control values during each pipeline stage. Because each control line is associated with a component active in only a single pipeline stage, we can divide the control lines into five groups according to the pipeline stage. FIGURE 3.28 1. Instruction fetch: The control signals to read instruction memory and to write the PC are always asserted, so there is nothing special to control in this pipeline stage. 2. Instruction decode/register file read: As in the previous stage, the same thing happens at every clock cycle, so there are no optional control lines to set. 3. Execution/address calculation: The signals to be set are RegDst, ALUOp, and ALUSrc . The signals select the Result register, the ALU operation, and either Read data 2 or a signextended immediate for the ALU.4. Memory access: The control lines set in this stage are Branch, MemRead, and MemWrite. These signals are set by the branch equal, load, and store instructions, respectively. Recall that PCSrc in Figure 4.48 selects the next sequential address unless control asserts Branch and the ALU result was 0. 5. Write-back: The two control lines are MemtoReg, which decides between sending the ALU result or the memory value to the register file, and RegWrite, which writes the chosen value. Since pipelining the datapath leaves the meaning of the control lines unchanged, we can use the same control values. 3.6 HANDLING DATA HAZARDS & CONTROL HAZARDS DATA HAZARDS Consider this sequence: sub and or add sw $2, $1,$3 $12,$2,$5 $13,$6,$2 $14,$2,$2 $15,100($2) # Register $2 written by sub # 1st operand($2) depends on sub # 2nd operand($2) depends on sub # 1st($2) & 2nd($2) depend on sub # Base ($2) depends on sub The last four instructions are all dependent on the result in register $2 of the first instruction. register $2 had the value 10 before the subtract instruction –20 after subtraction Intention of the logic is to use –20 from register $2. After subtraction operation i.e., the rest of the four instruction to use the value -20 from $2 register. FIGURE 3.30 shows the dependence of each instruction with respect to the first instruction SUB and the result stored in $2 register. As above, the and & or instructions would get the incorrect value 10(assumed value of $2 before execution of SUB instruction). Instructions that would get the correct value of –20 are add & sw (since both the instruction will need the value from / after CC5(clock cycle). TO DETECT AND FORWARD WITHOUT STALL: To avoid stall, the result can forwarded to and & or instruction from CC3 where the result is available at the end of EX stage. (As shown in above diagram). To detect the data hazard, the source register dependence on any destination register of the previous instruction can be found by checking the following condition.   EX/MEM.RegisterRd = ID/EX.RegisterRs EX/MEM.RegisterRd = ID/EX.RegisterRt When the result is available in CC3, and the above condition is TRUE then the hazard is detected and forwarding process must be initiated to the next instruction, at the end of EX stage (CC3) of the current instruction. Here in this example condition 1a is TRUE.  EX/MEM.RegisterRd = ID/EX.RegisterRs = $2 is TRUE. Similarly, When the result is available in CC4, and the below condition is TRUE then the hazard is detected and forwarding process must be initiated to the next instruction, at the end of MEM stage (CC4) of the current instruction.   MEM/WB.RegisterRd MEM/WB.RegisterRd = ID/EX.RegisterRs = ID/EX.RegisterRt Here in the given example Condition Two is TRUE.  MEM/WB.RegisterRd = ID/EX.RegisterRt= $2 is TRUE. Since the $2 is the second operand of the OR instruction, 2b is TRUE. Only the above all 4 conditions will work when the read operation is Active, i.e., when the rd register does hold the result as ZERO.  To forward/Bypass the result in CC3 or end of EX stage Either condition 1a and EX/MEM.RegisterRd ≠ 0 must be TRUE, or condition 1b and EX/MEM.RegisterRd ≠ 0 must be TRUE.  To forward/Bypass the result in CC4 or end of MEM stage Either condition 2a and MEM/WB.RegisterRd ≠ 0 must be TRUE, or condition 2b and MEM/WB.RegisterRd ≠ 0 must be TRUE. All the above conditions are checked by a special hardware called forwarding unit, refer the figure below. Figure3.31 TO INTRODUCE STALL WHEN FORWARDING FAILS Similarly, consider an example where the first instruction is an LOAD instruction, and the second instruction is dependent and needs the result of LOAD instruction in its EX stage. Then the forwarding not possible as data cannot be forwarded in time backward. The following diagram will show the need for stall operation. Similarly as done before to detect the data hazard, the conditions for source and destination register in IF/ID and ID/EX registers are checked respectively, if the register number is same then stall is introduced. STEPS TO INTRODUCED THE STALL Force control values in ID/EX register to 0. Introduction of NOP, stages EX, MEM and WB do NOP (no-operation) , in this case stall is introduced from the second instruction. Prevent update of PC and IF/ID register (in this case 3rd instruction will not be loaded immediately) Same instruction is decoded again. (second instruction is decoded ,as per given example) and the following instruction is fetched again. (3rd instruction is fetched as per given example). Now, the first instruction is moved to MEM cycle, so the result can be forwarded to the second instruction (given example). Figure 3.32 A pipelined sequence of instructions. Since the dependence between the load and the following instruction (and) goes backward in time, this hazard cannot be solved by forwarding. Hence, this combination must result in a stall by the hazard detection unit. Control Hazards Control hazards can cause a greater performance loss for DLX pipeline than data hazards. When a branch is executed, it may or may not change the PC (program counter) to something other than its current value plus 4. If a branch changes the PC to its target address, it is a taken branch; if it falls through, it is not taken. If instruction i is a taken branch, then the PC is normally not changed until the end of MEM stage, after the completion of the address calculation and comparison, hazards involving in arithmetic operations and data transfers. Figure 3.33 The impact of the pipeline on the branch instruction. The numbers to the left of the instruction (40, 44, . . . ) are the addresses of the instructions. Since the branch instruction decides whether to branch in the MEM stage—clock cycle 4 for the beq instruction above—the three sequential instructions that follow the branch will be fetched and begin execution. Without intervention, those three following instructions will begin execution before beq branches to lw at location 72. During ID, we must decode the instruction, decide whether a bypass to the equality unit is needed, and complete the equality comparison so that if the instruction is a branch, we can set the PC to the branch target address. Forwarding for the operands of branches was formerly handled by the ALU forwarding logic, but the introduction of the equality test unit in ID will require new forwarding logic. Note that the bypassed source operands of a branch can come from either the ALU/MEM or MEM/WB pipeline latches. Because the values in a branch comparison are needed during ID but may be produced later in time, it is possible that a data hazard can occur and a stall will be needed. For example, if an ALU instruction immediately preceding a branch produces one of the operands for the comparison in the branch, a stall will be required, since the EX stage for the ALU instruction will occur after the ID cycle of the branch. By extension, if a load is immediately followed by a conditional branch that is on the load result, two stall cycles will be needed, as the result from the load appears at the end of the MEM cycle but is needed at the beginning of ID for the branch. 3.7 EXCEPTIONS. Exceptions: The problem is that an instruction in the pipeline can raise an exception that may force other instructions in the pipeline to be aborted. The term exception to refer to any unexpected change in control flow without distinguishing whether the cause is internal or external; we use the term interrupt only when the event is externally caused. Here are five examples showing whether the situation is internally generated by the processor or externally generated: How Exceptions Are Handled in the MIPS Architecture The two types of exceptions that our current implementation can generate are execution of an undefined instruction and an arithmetic overflow. We‘ll use arithmetic overflow in the instruction add $1, $2, $1 as the example exception in the next few pages. The basic action that the processor must perform when an exception occurs is to save the address of the offending instruction in the exception program counter (EPC) and then transfer control to the operating system at some specified address. In a vectored interrupt, the address to which control is transferred is determined by the cause of the exception. For example, to accommodate the two exception types listed above, we might define the following two exception vector addresses: The operating system knows the reason for the exception by the address at which it is initiated. The addresses are separated by 32 bytes or eight instructions, and the operating system must record the reason for the exception and may perform some limited processing in this sequence. When the exception is not vectored, a single entry point for all exceptions can be used, and the operating system decodes the status register to find the cause. We can perform the processing required for exceptions by adding a few extra registers and control signals to our basic implementation and by slightly extending control. Let‘s assume that we are implementing the exception system used in the MIPS architecture, with the single entry point being the address 8000 0180hex. (Implementing vectored exceptions is no more difficult.) We will need to add two additional registers to the MIPS implementation:   EPC: A 32‑bit register used to hold the address of the affected instruction. (Such a register is needed even when exceptions are vectored.) Cause: A register used to record the cause of the exception. In the MIPS architecture, this register is 32 bits, although some bits are currently unused. Assume there is a five-bit field that encodes the two possible exceptions sources mentioned above, with 10 representing an undefined instruction and 12 representing arithmetic overflow. Exceptions in a Pipelined Implementation A pipelined implementation treats exceptions as another form of control hazard. For example, suppose there is an arithmetic overflow in an add instruction. Just as we did for the taken branch in the previous section, we must flush the instructions that follow the add instruction from the pipeline and begin fetching instructions from the new address. We will use the same mechanism we used for taken branches, but this time the exception causes the deasserting of control lines. FIGURE 3.34 The datapath with controls to handle exceptions. The key additions include a new input with the value 8000 0180hex in the multiplexor that supplies the new PC value; a Cause register to record the cause of the exception; and an Exception PC register to save the address of the instruction that caused the exception. The 8000 0180hex input to the multiplexor is the initial address to begin fetching instructions in the event of an exception. Although not shown, the ALU overflow signal is an input to the control unit. UNIT IV PARALLELISM Instruction-level-parallelism – Parallel processing challenges – Flynn‘s classification – Hardware multithreading – Multicore processors 4.1 Instruction-level-parallelism 4.2 Parallel processing challenges 4.3 Flynn‘s classification 4.4 Hardware multithreading 4.5 Multicore processors 4.1 INSTRUCTION-LEVEL-PARALLELISM What is meant by Instruction Level Parallelism Instruction-Level Parallelism: Concepts and Challenges: Instruction-level parallelism (ILP) is the potential overlap the execution of instructions using pipeline concept to improve performance of the system. The various techniques that are used to increase amount of parallelism are reduces the impact of data and control hazards and increases processor ability to exploit parallelism There are two approaches to exploiting ILP. 1. Static Technique – Software Dependent 2. Dynamic Technique – Hardware Dependent The simplest and most common way to increase the amount of parallelism is loop-level parallelism. Here is a simple example of a loop, which adds two 1000 -element arrays, that is completely parallel: for (i=1;i<=1000; i=i+1) x[i] = x[i] + y[i]; CPI (Cycles per Instruction) for a pipelined processor is the sum of the base CPI and all contributions from stalls: Pipeline CPI = Ideal pipeline CPI + Structural stalls + Data hazard stalls + Control stalls The ideal pipeline CPI is a measure of the maximum performance attainable by the implementation. By reducing each of the terms of the right-hand side, we minimize the overall pipeline CPI and thus increase the IPC (Instructions per Clock). Various types of Dependences in ILP. Data Dependence and Hazards: To exploit instruction-level parallelism, determine which instructions can be executed in parallel. If two instructions are parallel, they can execute simultaneously in a pipeline without causing any stalls. If two instructions are dependent they are not parallel and must be executed in order. There are three different types of dependences: data dependences (also called true data dependences), name dependences, and control dependences. Data Dependences: An instruction j is data dependent on instruction i if either of the following holds: • Instruction i produces a result that may be used by instruction j, or • Instruction j is data dependent on instruction k, and instruction k is data dependent on instruction i. The second condition simply states that one instruction is dependent on another if there exists a chain of dependences of the first type between the two instructions. This dependence chain can be as long as the entire program. For example, consider the following code sequence those increments a vector of values in memory (starting at 0(R1) and with the last element at 8(R2)) by a scalar in register F2: Loop: L.D F0,0(R1) ; F0=array element ADD.D F4,F0,F2 ; add scalar in F2 S.D F4,0(R1) ;store result DADDUI R1,R1,#-8 ;decrement pointer 8 bytes (/e BNE R1,R2,LOOP ; branch R1!=zero The dependence implies that there would be a chain of one or more data hazards between the two instructions. Executing the instructions simultaneously will cause a processor with pipeline interlocks to detect a hazard and stall, thereby reducing or eliminating the overlap. Dependences are a property of programs. Whether a given dependence results in an actual hazard being detected and whether that hazard actually causes a stall are properties of the pipeline organization. This difference is critical to understanding how instruction-level parallelism can be exploited. The presence of the dependence indicates the potential for a hazard, but the actual hazard and the length of any stall is a property of the pipeline. The importance of the data dependences is that a dependence (1) indicates the possibility of a hazard, (2) Determines the order in which results must be calculated, and (3) Sets an upper bound on how much parallelism can possibly be exploited. Name Dependences The name dependence occurs when two instructions use the same register or memory location, called a name, but there is no flow of data between the instructions associated with that name. There are two types of name dependences between an instruction i that precedes instruction j in program order: • An anti dependence between instruction i and instruction j occurs when instruction j writes a register or memory location that instruction i reads. The original ordering must be preserved to ensure that i reads the correct value. • An output dependence occurs when instruction i and instruction j write the same register or memory location. The ordering between the instructions must be preserved to ensure that the value finally written corresponds to instruction j. Both anti-dependences and output dependences are name dependences, as opposed to true data dependences, since there is no value being transmitted between the instructions. Since a name dependence is not a true dependence, instructions involved in a name dependence can execute simultaneously or be reordered, if the name (register number or memory location) used in the instructions is changed so the instructions do not conflict. This renaming can be more easily done for register operands, where it is called register renaming. Register renaming can be done either statically by a compiler or dynamically by the hardware. Before describing dependences arising from branches, let‘s examine the relationship between dependences and pipeline data hazards. Control Dependences: A control dependence determines the ordering of an instruction, i, with respect to a branch instruction so that the instruction i is executed in correct program order. Every instruction, except for those in the first basic block of the program, is control dependent on some set of branches, and, in general, these control dependences must be preserved to preserve program order. One of the simplest examples of a control dependence is the dependence of the statements in the ―then‖ part of an if statement on the branch. For example, in the co de segment: if p1 { S1; }; if p2 { S2; } S1 is control dependent on p1, and S2is control dependent on p2 but not on p1. In general, there are two constraints imposed by control dependences: 1. An instruction that is control dependent on a branch cannot be moved before the branch so that its execution is no longer controlled by the branch. For example, we cannot take an instruction from the then-portion of an if-statement and move it before the if- statement. 2. An instruction that is not control dependent on a branch cannot be moved after the branch so that its execution is controlled by the branch. For example, we cannot take a statement before the if-statement and move it into the then-portion. Control dependence is preserved by two properties in a simple pipeline, First, instructions execute in program order. This ordering ensures that an instruction that occurs before a branch is executed before the branch. Second, the detection of control or branch hazards ensures that an instruction that is control dependent on a branch is not executed until the branch direction is known. Data Hazard and various hazards in ILP. Data Hazards A hazard is created whenever there is a dependence between instructions, and they are close enough that the overlap caused by pipelining, or other reordering of instructions, would change the order of access to the operand involved in the dependence. Because of the dependence, preserve order that the instructions would execute in, if executed sequentially one at a time as determined by the original source program. The goal of both our software and hardware techniques is to exploit parallelism by preserving program order only where it affects the outcome of the program. Detecting and avoiding hazards ensures that necessary program order is preserved. Data hazards may be classified as one of three types, depending on the order of read and write accesses in the instructions. Consider two instructions i and j, with i occurring before j in program order. The possible data hazards are RAW (read after write) — j tries to read a source before i writes it, so j incorrectly gets the old value. This hazard is the most common type and corresponds to a true data dependence. Program order must be preserved to ensure that j receives the value from i. In the simple common five-stage static pipeline a load instruction followed by an integer ALU instruction that directly uses the load result will lead to a RAW hazard. WAW (write after write) — j tries to write an operand before it is written by i. The writes end up being performed in the wrong order, leaving the value written by i rather than the value written by j in the destination. This hazard corresponds to an output dependence. WAW hazards are present only in pipelines that write in more than one pipe stage or allow an instruction to proceed even when a previous instruction is stalled. The classic five-stage integer pipeline writes a register only in the WB stage and avoids this class of hazards. WAR (write after read) — j tries to write a destination before it is read by i, so i incorrectly gets the new value. This hazard arises from an antidependence. WAR hazards cannot occur in most static issue pipelines even deeper pipelines or floating point pipelines because all reads are early (in ID) and all writes are late (in WB). A WAR hazard occurs either when there are some instructions that write results early in the instruction pipeline, and other instructions that read a source late in the pipeline or when instructions are reordered. 4.2 PARALLEL PROCESSING CHALLENGES AMDAHL’S LAW The execution time of the program after making the improvement is given by the following simple equation known as Amdahl‘s law: Execution time after improvement = (Execution time affected by improvement / Amount of improvement)+ Execution time unaffected For this problem: Execution time after improvement =( 80 seconds/n) + (100 − 80 seconds) Since we want the performance to be five times faster, the new execution time should be 20 seconds, giving 20 seconds =(80 seconds/n) + 20 seconds 0 = 80 seconds/n That is, there is no amount by which we can enhance-multiply to achieve a fivefold increase in performance, if multiply accounts for only 80% of the workload. The performance enhancement possible with a given improvement is limited by the amount that the improved feature is used. This concept also yields what we call the law of diminishing returns in everyday life. We can use Amdahl‘s law to estimate performance improvements when we know the time consumed for some function and its potential speedup. Amdahl‘s law, together with the CPU performance equation, is a handy tool for evaluating potential enhancements. Speedup: The speed of a program is the time it takes the program to execute. This could be measured in any increment of time. Speedup is defined as the time it takes a program to execute in serial (with one processor) divided by the time it takes to execute in parallel (with many processors). The formula for speedup is: S = T(1)/T(j) Where T(j) is the time it takes to execute the program when using j processors. Efficiency is the speedup, divided by the number of processors used. 4.3 FLYNN’S CLASSIFICATION Flynn's taxonomy is a classification of computer architectures, proposed by Michael J. Flynn in 1966.The classification system has stuck, and has been used as a tool in design of modern processors and their functionalities. Since the rise of multiprocessing central processing units (CPUs), a multiprogramming context has evolved as an extension of the classification system. Instruction Cycle The instruction cycle consists of a sequence of steps needed for the execution of an instruction in a program. A typical instruction in a program is composed of two parts: Opcode and Operand. The Operand part specifies the data on which the specified operation is to be done. (See Figure 1). The Operand part is divided into two parts: addressing mode and the Operand. The addressing mode specifies the method of determining the addresses of the actual data on which the operation is to be performed and the operand part is used as an argument by the method in determining the actual address. The control unit of the CPU of the computer fetches instructions in the program, one at a time. The fetched Instruction is then decoded by the decoder which is a part of the control unit and the processor executes the decoded instructions. The result of execution is temporarily stored in Memory Buffer Register (MBR) (also called Memory Data Register). Instruction Stream and Data Stream The term ‗stream‘ refers to a sequence or flow of either instructions or data operated on by the computer. In the complete cycle of instruction execution, a flow of instructions from main memory to the CPU is established. This flow of instructions is called instruction stream. Similarly, there is a flow of operands between processor and memory bi-directionally. This flow of operands is called data stream. Thus, it can be said that the sequence of instructions executed by CPU forms the Instruction streams and sequence of data (operands) required for execution of instructions form the Data streams.     Single instruction, single data stream - SISD Single instruction, multiple data stream - SIMD Multiple instruction, single data stream - MISD Multiple instruction, multiple data stream- MIMD Single instruction stream Multiple instruction streams Single program Single data stream Multiple data streams SISD MISD SIMD MIMD Multiple programs Multiple programs SPMD MPMD Single instruction stream, single data stream (SISD) A sequential computer which exploits no parallelism in either the instruction or data streams. Single control unit (CU) fetches single instruction stream (IS) from memory. The CU then generates appropriate control signals to direct single processing element (PE) to operate on single data stream (DS) i.e., one operation at a time. In this organisation, sequential execution of instructions is performed by one CPU containing a single processing element (PE), i.e., ALU under one control unit as shown Therefore, SISD machines are conventional serial computers that process only one stream of instructions and one stream of data. This type of computer organisation is depicted in the diagram:            Single processor Single instruction stream Data stored in single memory Uni-processor Single machine instruction Controls simultaneous execution Number of processing elements Lockstep basis Each processing element has associated data memory Each instruction executed on different set of data by different processors Vector and array processors Examples of SISD architecture are the traditional uniprocessor machines like older personal computers (PCs; by 2010, many PCs had multiple cores) and mainframe computers. Single instruction stream, multiple data streams (SIMD) A computer which exploits multiple data streams against a single instruction stream to perform operations which may be naturally parallelized. For example, an array processor or graphics processing unit (GPU) In this organisation, multiple processing elements work under the control of a single control unit. It has one instruction and multiple data stream. All the processing elements of this organization receive the same instruction broadcast from the CU. Main memory can also be divided into modules for generating multiple data streams acting as a distributed memory as shown. Therefore, all the processing elements simultaneously execute the same instruction and are said to be 'lock-stepped' together. Each processor takes the data from its own memory and hence it has on distinct data streams. (Some systems also provide a shared global memory for communications.) Every processor must be allowed to complete its instruction before the next instruction is taken for execution. Thus, the execution of instructions is synchronous. Examples of SIMD organisation are ILLIAC-IV, PEPE, BSP, STARAN, MPP, DAP and the Connection Machine (CM-1). Multiple instruction streams, single data stream (MISD) Multiple instructions operate on one data stream. Uncommon architecture which is generally used for fault tolerance. Heterogeneous systems operate on the same data stream and must agree on the result. Examples include the Space Shuttle flight control computer.  Sequence of data  Transmitted to set of processors  Each processor executes different instruction sequence  Never been implemented  Sequence of data  Transmitted to set of processors  Each processor executes different instruction sequence  Never been implemented  Set of processors  Simultaneously execute different instruction sequences  Different sets of data  SMPs, clusters and NUMA systems  In this organization, multiple processing elements are organised under the control of multiple control units. Each control unit is handling one instruction stream and processed through its corresponding processing element. But each processing element is processing only a single data stream at a time. Therefore, for handling multiple instruction streams and single data stream, multiple control units and multiple processing elements are organised in this classification. All processing elements are interacting with the common shared memory for the organisation of single data stream as shown. The only known example of a computer capable of MISD operation is the C.mmp built by Carnegie-Mellon University. This type of computer organisation is denoted as: Is > 1 Ds = 1 This classification is not popular in commercial machines as the concept of single data streams executing on multiple processors is rarely applied. But for the specialized applications, MISD organisation can be very helpful. For example, Real time computers need to be fault tolerant where several processors execute the same data for producing the redundant data. This is also known as N- version programming. All these redundant data are compared as results which should be same; otherwise faulty unit is replaced. Thus MISD machines can be applied to fault tolerant real time computers. Multiple instruction streams, multiple data streams (MIMD) Multiple autonomous processors simultaneously executing different instructions on different data. MIMD architectures include multi-core superscalar processors, and distributed systems, using either one shared memory space or a distributed memory space. In this organization, multiple processing elements and multiple control units are organized as in MISD. But the difference is that now in this organization multiple instruction streams operate on multiple data streams. Therefore, for handling multiple instruction streams, multiple control units and multiple processing elements are organized such that multiple processing elements are handling multiple data streams from the Main memory as shown in. The processors work on their own data with their own instructions. Tasks executed by different processors can start or finish at different times. They are not lock-stepped, as in SIMD computers, but run asynchronously. This classification actually recognizes the parallel computer. That means in the real sense MIMD organisation is said to be a Parallel computer. All multiprocessor systems fall under this classification. Examples include; C.mmp, Burroughs D825, Cray-2, S1, Cray X- MP, HEP, Pluribus, IBM 370/168 MP, Univac 1100/80, Tandem/16, IBM 3081/3084, C.m*, BBN Butterfly, Meiko Computing Surface (CS-1), FPS T/40000, iPSC. This type of computer organisation is denoted as: Is > 1, Ds > 1 Taxonomy of Parallel Processor Architectures MIMD – Overview    General purpose processors Each can process all instructions necessary Further classified by method of processor communication Tightly Coupled – SMP      Processors share memory Communicate via that shared memory Symmetric Multiprocessor (SMP) Share single memory or pool Shared bus to access memory  Memory access time to given area of memory is approximately the same for each processor Tightly Coupled - NUMA   Nonuniform memory access Access times to different regions of memory may differ Loosely Coupled - Clusters    Collection of independent uniprocessors or SMPs Interconnected to form a cluster Communication via fixed path or network connections Parallel Organizations – SISD Parallel Organizations – SIMD Parallel Organizations - MIMD Shared Memory Parallel Organizations – MIMD Distributed Memory Symmetric Multiprocessors            A stand alone computer with the following characteristics Two or more similar processors of comparable capacity Processors share same memory and I/O Processors are connected by a bus or other internal connection Memory access time is approximately the same for each processor All processors share access to I/O Either through same channels or different channels giving paths to same devices All processors can perform the same functions (hence symmetric) System controlled by integrated operating system providing interaction between processors Interaction at job, task, file and data element levels Multiprogramming and Multiprocessing SMP Advantages         Performance If some work can be done in parallel Availability Since all processors can perform the same functions, failure of a single processor does not halt the system Incremental growth User can enhance performance by adding additional processors Scaling Vendors can offer range of products based on number of processors Block Diagram of Tightly Coupled Multiprocessor Organization Classification    Time shared or common bus Multiport memory Central control unit 4.4 HARDWARE MULTITHREADING Multithreading is the ability of a central processing unit (CPU) or a single core in a multi-core processor to execute multiple processes or threads concurrently, appropriately supported by the operating system. This approach differs from multiprocessing, as with multithreading the processes and threads have to share the resources of a single or multiple cores: the computing units, the CPU caches, and the translation look aside buffer (TLB). Where multiprocessing systems include multiple complete processing units, multithreading aims to increase utilization of a single core by using thread-level as well as instruction-level parallelism. As the two techniques are complementary, they are sometimes combined in systems with multiple multithreading CPUs and in CPUs with multiple multithreading cores. Hardware multithreading allows multiple threads to share the functional units of a single processor in an overlapping fashion. To permit this sharing, the processor must duplicate the independent state of each thread. For example, each thread would have a separate copy of the register file and the PC. The memory itself can be shared through the virtual memory mechanisms, which already support multiprogramming. The hardware must support the ability to change to a different thread relatively quickly. In particular, a thread switch should be much more efficient than a process switch, which typically requires hundreds to thousands of processor cycles while a thread switch can be instantaneous. There are two main approaches to hardware multithreading. Fine-grained multithreading: switches between threads on each instruction, resulting in interleaved execution of multiple threads. This interleaving is often done in a round robin fashion, skipping any threads that are stalled at that time. To make fine-grained multithreading practical, the processor must be able to switch threads on every clock cycle. One key advantage of fine-grained multithreading is that it can hide the throughput losses that arise from both short and long stalls, since instructions from other threads can be executed when one thread stalls. The primary disadvantage of fine-grained multithreading is that it slows down the execution of the individual threads, since a thread that is ready to execute without stalls will be delayed by instructions from other threads. Coarse-grained multithreading was invented as an alternative to fine-grained multithreading. Coarse-grained multithreading switches threads only on costly stalls, such as second-level cache misses. This change relieves the need to have thread switching be essentially free and is much less likely to slow down the execution of an individual thread, since instructions from other threads will only be issued when a thread encounters a costly stall Coarse-grained multithreading suffers, Drawback: it is limited in its ability to overcome throughput losses, especially from shorter stalls. This limitation arises from the pipeline start-up costs of coarse-grained multithreading. Because a processor with coarse-grained multithreading issues instructions from a single thread, when a stall occurs, the pipeline must be emptied or frozen. The new thread that begins executing after the stall must fill the pipeline before instructions will be able to complete. Due to this start-up overhead, coarse-grained multithreading is much more useful for reducing the penalty of high-cost stalls, where pipeline refill is negligible compared to the stall time. Simultaneous multithreading (SMT) is a variation on hardware multithreading that uses the resources of a multiple-issue, dynamically scheduled processor to exploit thread-level parallelism at the same time it exploits instruction-level parallelism. The key insight that motivates SMT is that multiple-issue processors often have more functional unit parallelism available than a single thread can effectively use. In the superscalar without hardware multithreading support, the use of issue slots is limited by a lack of instruction-level parallelism. In addition, a major stall, such as an instruction cache miss, can leave the entire processor idle. In the coarse-grained multithreaded superscalar, the long stalls are partially hidden by switching to another thread that uses the resources of the processor. It reduces the number of completely idle clock cycles, the pipeline start-up overhead still leads to idle cycles, and limitations to ILP means all issue slots will not be used. In the fine-grained case, the interleaving of threads mostly eliminates fully empty slots. Because only a single thread issues instructions in a given clock cycle. FIGURE How four threads use the issue slots of a superscalar processor in different approaches. The four threads at the top show how each would execute running alone on a standard superscalar processor without multithreading support. The three examples at the bottom show how they would execute running together in three multithreading options. The horizontal dimension represents the instruction issue capability in each clock cycle. The vertical dimension represents a sequence of clock cycles. An empty (white) box indicates that the corresponding issue slot is unused in that clock cycle. The shades of gray and color correspond to four different threads in the multithreading processors. The additional pipeline start-up effects for coarse multithreading, which are not illustrated in this figure, would lead to further loss in throughput for coarse multithreading. 4.5 MULTICORE PROCESSORS A multi-core processor is a single computing component with two or more independent actual processing units (called "cores"), which are the units that read and execute program instructions. The instructions are ordinary CPU instructions (such as add, move data, and branch), but the multiple cores can run multiple instructions at the same time, increasing overall speed for programs amenable to parallel computing. Manufacturers typically integrate the cores onto a single integrated circuit die (known as a chip multiprocessor or CMP), or onto multiple dies in a single chip package. Multi-core processors may have:       Two cores (dual-core CPUs, for example, AMD Phenom II X2 and Intel Core Duo) Three cores (tri-core CPUs, for example, AMD Phenom II X3) Four cores (quad-core CPUs, for example, AMD Phenom II X4, Intel's i5 and i7 processors) Six cores (hexa-core CPUs, for example, AMD Phenom II X6 and Intel Core i7 Extreme Edition 980X) Eight cores (octa-core CPUs, for example, Intel Core i7 5960X Extreme Edition and AMD FX-8350) Ten cores (deca-core CPUs, for example, Intel Xeon E7-2850) A multi-core processor implements multiprocessing in a single physical package. Designers may couple cores in a multi-core device tightly or loosely. Common network topologies to interconnect cores include bus, ring, two-dimensional mesh, and crossbar. Homogeneous multi-core systems include only identical cores; heterogeneous multi-core systems have cores that are not identical (e.g. big.LITTLE) Multi-core processors are widely used across many application domains, including general-purpose, embedded, network, digital signal processing (DSP), and graphics (GPU) Benefits of Multi-Core Processors Multi-core processors offer developers the ability to apply more compute resources at a particular problem. These additional resources can be employed to offer two types of advantages, improved turnaround time or solving larger problem domains. An improved turnaround time example is the processing of a transaction in a Point-of-Sales system. The time it takes to process a transaction may be improved by taking advantage of multicore processors. While one processor core is updating the terminal display, another processor core could be tasked with processing the user input. A larger problem domain example would be the servers that handle backend processing of the transactions arriving from the various Point-of-Sales terminals. By taking advantage of multi-core processor, any one server could handle a greater number of transactions in the desired response time. A shared-memory multiprocessor is a computer system composed of multiple independent processors that execute different instruction streams. Using Flynns‘s classification an SMP is a multiple-instruction multiple-data (MIMD) architecture. The processors share a common memory address space and communicate with each other via memory. A typical shared-memory multiprocessor includes some number of processors with local caches, all interconnected with each other and with common memory via an interconnection (e.g., a bus). Shared-memory multiprocessors can either be symmetric or asymmetric. Symmetric systems imply that all processors that compose the system are identical. Conversely, asymmetric systems have different types of processors sharing memory. Most multicore chips are single-chip symmetric shared-memory multiprocessors Uniform memory access Uniform memory access (UMA) is a shared memory architecture used in parallel computers. All the processors in the UMA model share the physical memory uniformly. In UMA architecture, access time to a memory location is independent of which processor makes the request or which memory chip contains the transferred data. Uniform memory access computer architectures are often contrasted with non-uniform memory access (NUMA) architectures. In the UMA architecture, each processor may use a private cache. Peripherals are also shared in some fashion. The UMA model is suitable for general purpose and time sharing applications by multiple users. It can be used to speed up the execution of a single large program in time-critical applications There are three types of UMA architectures:    UMA using bus-based symmetric multiprocessing (SMP) architectures; UMA using crossbar switches; UMA using multistage interconnection networks. Non-uniform memory access (NUMA) Non-uniform memory access (NUMA) is a computer memory design used in multiprocessing, where the memory access time depends on the memory location relative to the processor. Under NUMA, a processor can access its own local memory faster than nonlocal memory (memory local to another processor or memory shared between processors). The benefits of NUMA are limited to particular workloads, notably on servers where the data are often associated strongly with certain tasks or users. NUMA architectures logically follow in scaling from symmetric multiprocessing (SMP) architectures. The processors connect to the bus or crossbar by connections of varying thickness/number. This shows that different CPUs have different access priorities to memory based on their relative location. UNIT V MEMORY AND I/O SYSTEMS Memory hierarchy – Memory technologies – Cache basics – Measuring and improving cache performance – Virtual memory, TLBs – Input/output system, programmed I/O, DMA and interrupts, I/O processors. 5.1 Memory hierarchy 5.2 Memory technologies 5.3 Cache basics 5.4 Measuring and improving cache performance 5.5 Virtual memory 5.6 TLBs 5.7 Input/output system 5.8 programmed I/O 5.9 DMA and interrupts 5.10 I/O processors. 5.1 MEMORY HIERARCHY A typical memory hierarchy starts with a small, expensive, and relatively fast unit, called the cache, followed by a larger, less expensive, and relatively slow main memory unit. Cache and main memory are built using solid-state semiconductor material (typically CMOS transistors). It is customary to call the fast memory level the primary memory. The solid-state memory is followed by larger, less expensive, and far slower magnetic memories that consist typically of the (hard) disk and the tape. It is customary to call the disk the secondary memory, while the tape is conventionally called the tertiary memory. The objective behind designing a memory hierarchy is to have a memory system that performs as if it consists entirely of the fastest unit and whose cost is dominated by the cost of the slowest unit. The memory hierarchy can be characterized by a number of parameters. Among these Parameters are the access type, capacity, cycle time, latency, bandwidth, and cost. The Term access refers to the action that physically takes place during a read or writes operation. The capacity of a memory level is usually measured in bytes. The cycle time is defined as the time elapsed from the start of a read operation to the start of a subsequent read. The latency is defined as the time interval between the request for information and the access to the first bit of that information. The bandwidth provides a measure of the number of bits per second that can be accessed. The cost of a memory level is usually specified as dollars per megabytes. The term random access refers to the fact that any access to any memory location takes the same fixed amount of time regardless of the actual memory location and/or the sequence of accesses that takes place. For example, if a write operation to memory location 100 takes 15 ns and if this operation is followed by a read operation to memory location 3000, then the latter operation will also take 15 ns. This is to be compared to sequential access in which if access to location 100 takes 500 ns, and if a consecutive access to location 101 takes 505 ns, then it is expected that an access to location 300 may take 1500 ns. This is because the memory has to cycle through locations 100 to 300, with each location requiring 5 ns. The effectiveness of a memory hierarchy depends on the principle of moving information into the fast memory infrequently and accessing it many times before replacing it with new information. This principle is possible due to a phenomenon called locality of reference; that is, within a given period of time, programs tend to reference a relatively confined area of memory repeatedly. There exist two forms of locality: spatial and temporal locality. Spatial locality refers to the phenomenon that when a given address has been referenced, it is most likely that addresses near it will be referenced within a short period of time. The memory is divided into large number of small parts called cells. Each location or cell has a unique address which varies from zero to memory size minus one. For example if computer has 64k words, then this memory unit has 64 * 1024=65536 memory locations. The address of these locations varies from 0 to 65535. Figure Typical memory hierarchy TABLE Memory Hierarchy Parameters 5.2 MEMORY TECHNOLOGIES Memory is primarily of three types  Cache Memory  Primary Memory/Main Memory  Secondary Memory Cache Memory Cache memory is a very high speed semiconductor memory which can speed up CPU. It acts as a buffer between the CPU and main memory. It is used to hold those parts of data and program which are most frequently used by CPU. The parts of data and programs are transferred from disk to cache memory by operating system, from where CPU can access them. Advantages  The advantages of cache memory are as follows:  Cache memory is faster than main memory.  It consumes less access time as compared to main memory.  It stores the program that can be executed within a short period of time.  It stores data for temporary use. Disadvantages  The disadvantages of cache memory are as follows:  Cache memory has limited capacity.  It is very expensive. Primary Memory (Main Memory) Primary memory holds only those data and instructions on which computer is currently working. It has limited capacity and data is lost when power is switched off. It is generally made up of semiconductor device. These memories are not as fast as registers. The data and instruction required to be processed reside in main memory. It is divided into two subcategories RAM and ROM. Characteristics of Main Memory  These are semiconductor memories  It is known as main memory.      Usually volatile memory. Data is lost in case power is switched off. It is working memory of the computer. Faster than secondary memories. A computer cannot run without primary memory. Secondary Memory This type of memory is also known as external memory or non-volatile. It is slower than main memory. These are used for storing data/Information permanently. CPU directly does not access these memories instead they are accessed via input-output routines. Contents of secondary memories are first transferred to main memory, and then CPU can access it. For example : disk, CD-ROM, DVD etc. Characteristic of Secondary Memory  These are magnetic and optical memories  It is known as backup memory.  It is non-volatile memory.  Data is permanently stored even if power is switched off.  It is used for storage of data in a computer.  Computer may run without secondary memory.  Slower than primary memories. RAM(Random Access Memory) is the internal memory of the CPU for storing data, program and program result. It is read/write memory which stores data until the machine is working. As soon as the machine is switched off, data is erased. Access time in RAM is independent of the address that is, each storage location inside the memory is as easy to reach as other locations and takes the same amount of time. Data in the RAM can be accessed randomly but it is very expensive. RAM is volatile, i.e. data stored in it is lost when we switch off the computer or if there is a power failure. Hence a backup uninterruptible power system(UPS) is often used with computers. RAM is small, both in terms of its physical size and in the amount of data it can hold. RAM is of two types 1. Static RAM (SRAM) 2. Dynamic RAM (DRAM) Static RAM (SRAM) The word static indicates that the memory retains its contents as long as power is being supplied. However, data is lost when the power gets down due to volatile nature. SRAM chips use a matrix of 6-transistors and no capacitors. Transistors do not require power to prevent leakage, so SRAM need not have to be refreshed on a regular basis. Because of the extra space in the matrix, SRAM uses more chips than DRAM for the same amount of storage space, thus making the manufacturing costs higher. So SRAM is used as cache memory and has very fast access. Characteristic of the Static RAM  It has long life  There is no need to refresh  Faster  Used as cache memory  Large size  Expensive  High power consumption Dynamic RAM (DRAM) DRAM, unlike SRAM, must be continually refreshed in order to maintain the data. This is done by placing the memory on a refresh circuit that rewrites the data several hundred times per second. DRAM is used for most system memory because it is cheap and small. All DRAMs are made up of memory cells which are composed of one capacitor and one transistor. Characteristics of the Dynamic RAM  It has short data lifetime  Need to be refreshed continuously  Slower as compared to SRAM  Used as RAM  Lesser in size  Less expensive  Less power consumption  Secondary Memory / Non Volatile Memory: Secondary memory is external and permanent memory that is useful to store the external storage media such as floppy disk, magnetic disks, magnetic tapes and etc cache devices. Secondary memory deals with following types of components. ROM stands for Read Only Memory. The memory from which we can only read but cannot write on it. This type of memory is non-volatile. The information is stored permanently in such memories during manufacture. A ROM, stores such instructions that are required to start a computer. This operation is referred to as bootstrap. ROM chips are not only used in the computer but also in other electronic items like washing machine and microwave oven. ROM:Following are the various types of ROM MROM (Masked ROM) The very first ROMs were hard-wired devices that contained a pre-programmed set of data or instructions. These kind of ROMs are known as masked ROMs which are inexpensive. PROM (Programmable Read only Memory) PROM is read-only memory that can be modified only once by a user. The user buys a blank PROM and enters the desired contents using a PROM program. Inside the PROM chip there are small fuses which are burnt open during programming. It can be programmed only once and is not erasable. EPROM (Erasable and Programmable Read Only Memory) The EPROM can be erased by exposing it to ultra-violet light for a duration of up to 40 minutes. Usually, an EPROM eraser achieves this function. During programming, an electrical charge is trapped in an insulated gate region. The charge is retained for more than ten years because the charge has no leakage path. For erasing this charge, ultra-violet light is passed through a quartz crystal window(lid). This exposure to ultra-violet light dissipates the charge. During normal use the quartz lid is sealed with a sticker. EEPROM (Electrically Erasable and Programmable Read Only Memory) The EEPROM is programmed and erased electrically. It can be erased and reprogrammed about ten thousand times. Both erasing and programming take about 4 to 10 ms (milli second). In EEPROM, any location can be selectively erased and programmed. EEPROMs can be erased one byte at a time, rather than erasing the entire chip. Hence, the process of reprogramming is flexible but slow. Advantages of ROM The advantages of ROM are as follows:  Non-volatile in nature  These cannot be accidentally changed  Cheaper than RAMs  Easy to test  More reliable than RAMs  These are static and do not require refreshing Cache Memory: Memory less than the access time of CPU so, the performance will decrease through less access time. Speed mismatch will decrease through maintain cache memory. Main memory can store huge amount of data but the cache memory normally kept small and low expensive cost. All types of external media like Magnetic disks, Magnetic drives and etc store in cache memory to provide quick access tools to the users. 5.3 CACHE BASICS Cache Memory Cache memory is a very high speed semiconductor memory which can speed up CPU. It acts as a buffer between the CPU and main memory. It is used to hold those parts of data and program which are most frequently used by CPU. The parts of data and programs are transferred from disk to cache memory by operating system, from where CPU can access them. Advantages  Cache memory is faster than main memory.  It consumes less access time as compared to main memory.  It stores the program that can be executed within a short period of time.  It stores data for temporary use. Disadvantages  Cache memory has limited capacity.  It is very expensive.  Virtual memory is a technique that allows the execution of processes which are not completely available in memory. The main visible advantage of this scheme is that programs can be larger than physical memory. Virtual memory is the separation of user logical memory from physical memory. The information expected to be used more frequently by the CPU in the cache(a small highspeed memory that is near the CPU). The end result is that at any given time some active portion of the main memory is duplicated in the cache. Therefore, when the processor makes a request for a memory reference, the request is first sought in the cache. If the request corresponds to an element that is currently residing in the cache, we call that a cache hit. A cache hit ratio, hc, is defined as the probability of finding the requested element in the cache. A cache miss ratio (1- hc) is defined as the probability of not finding the requested element in the cache. In the case that the requested element is not found in the cache, then it has to be brought from a subsequent memory level in the memory hierarchy. Figure: Memory interleaving using eight modules Impact of Temporal Locality In this case, we assume that instructions in program loops, which are executed many times, for example, n times, once loaded into the cache, are used more than once before they are replaced by new instructions. The average access time, tav, is given by In deriving the above expression, it was assumed that the requested memory element has created a cache miss, thus leading to the transfer of a main memory block in time tm. following that, n accesses were made to the same requested element, each taking tc. The above expression reveals that as the number of repeated accesses, n, increases, the average access time decreases, a desirable feature of the memory hierarchy. Impact of Spatial Locality In this case, it is assumed that the size of the block transferred from the main memory to the cache, upon a cache miss, is m elements. We also assume that due to spatial locality, all m elements were requested, one at a time, by the processor. Based on these assumptions, the average access time, tav, is given by In deriving the above expression, it was assumed that the requested memory element has created a cache miss, thus leading to the transfer of a main memory block, consisting of m elements, in time tm. Following that, m accesses, each for one of the elements constituting the block, were made. The above expression reveals that as the number of elements in a block, m, increases, the average access time decreases, a desirable feature of the memory hierarchy. Cache-Mapping Function We present cache-mapping function taking into consideration the interface between two successive levels in the memory hierarchy: primary level and secondary level. If the focus is on the interface between the cache and main memory, then the cache represents the primary level, while the main memory represents the secondary level. The same principles apply to the interface between any two memory levels in the hierarchy. It should be noted that a request for accessing a memory element is made by the processor through issuing the address of the requested element. The address issued by the processor may correspond to that of an element that exists currently in the cache (cache hit); otherwise, it may correspond to an element that is currently residing in the main memory. Therefore, address translation has to be made in order to determine the whereabouts of the requested element. This is one of the functions performed by the memory management unit (MMU). the system address represents the address issued by the processor for the requested element. This address is used by an address translation function inside the MMU. If address translation reveals that the issued address corresponds to an element currently residing in the cache, then the element will be made available to the processor. If, on the other hand, the element is not currently in the cache, then it will be brought (as part of a block) from the main memory and placed in the cache and the element requested is made available to the processor. Figure : Address mapping operation Cache Memory Organization There are three main different organization techniques used for cache memory. The three techniques are discussed below. These techniques differ in two main aspects: 1. The criterion used to place, in the cache, an incoming block from the main memory. 2. The criterion used to replace a cache block by an incoming block (on cache full). Direct Mapping This is the simplest among the three techniques. Its simplicity stems from the fact that it places an incoming main memory block into a specific fixed cache block location. The placement is done based on a fixed relation between the incoming block number, i, the cache block number, j, and the number of cache blocks, N: j = i mod N The steps of the protocol are: 1. Use the Block field to determine the cache block that should contain the element requested by the processor. The Block field is used directly to determine the cache block sought, hence the name of the technique: direct-mapping. 2. Check the corresponding Tag memory to see whether there is a match between its content and that of the Tag field. A match between the two indicates that the targeted cache block determined in step 1 is currently holding the main memory element requested by the processor, that is, a cache hit. 3. Among the elements contained in the cache block, the targeted element can be selected using the Word field. 4. If in step 2, no match is found, then this indicates a cache miss. Therefore, the required block has to be brought from the main memory, deposited in the cache, and the targeted element is made available to the processor. The cache Tag memory and the cache block memory have to be updated accordingly. Figure : Direct-mapped address translation Fully Associative Mapping In this technique, an incoming main memory block can be placed in any available cache block. Therefore, the address issued by the processor need only have two fields. These are the Tag and Word fields. The first uniquely identifies the block while residing in the cache. The second field identifies the element within the block that is requested by the processor. The MMU interprets the address issued by the processor by dividing it into two fields as shown in Figure. The length, in bits, of each of the fields in Figure are given by: 1. Word field = log2 B, where B is the size of the block in words 2. Tag field = log2 M, where M is the size of the main memory in blocks 3. The number of bits in the main memory address = log2 (B * M) It should be noted that the total number of bits as computed by the first two equations should add up to the length of the main memory address. This can be used as a check for the correctness of your computation. Figure :Associative-mapped address fields 1. Use the Tag field to search in the Tag memory for a match with any of the tags stored. 2. A match in the tag memory indicates that the corresponding targeted cache block determined in step 1 is currently holding the main memory element requested by the processor, that is, a cache hit. 3.Among the elements contained in the cache block, the targeted element can be selected using the Word field. 4. If in step 2, no match is found, then this indicates a cache miss. Therefore, the required block has to be brought from the main memory, deposited in the first available cache block, and the targeted element (word) is made available to the processor. The cache Tag memory and the cache block memory have to be updated accordingly. It should be noted that the search made in step 1 above requires matching the tag field of the address with each and every entry in the tag memory. It should be noted that, regardless of the cache organization used, a mechanism is needed to ensure that any accessed cache block contains valid information. The validity of the information in a cache block can be checked via the use of a single bit for each cache block, called the valid bit. The valid bit of a cache block should be updated in such a way that if valid bit =1, then the corresponding cache block carries valid information; otherwise, the information in the cache block is invalid. Figure: Associative-mapped address translation Set-Associative Mapping In the set-associative mapping technique, the cache is divided into a number of sets. Each set consists of a number of blocks. A given main memory block maps to a specific cache set based on the equation S= i mod S, where S is the number of sets in the cache, i is the main memory block number, and s is the specific cache set to which block i maps. However, an incoming block maps to any block in the assigned cache set. Therefore, the address issued by the processor is divided into three distinct fields. These are the Tag, Set, and Word fields. The Set field is used to uniquely identify the specific cache set that ideally should hold the targeted block. The Tag field uniquely identifies the targeted block within the determined set. The Word field identifies the element (word) within the block that is requested by the processor. According to the set-associative mapping technique, the MMU interprets the address issued by the processor by dividing it into three fields. The length, in bits, of each of the fields is given by 1. Word field = log2 B, where B is the size of the block in words 2. Set field = log2 S, where S is the number of sets in the cache 3. Tag field = log2 (M/S), where M is the size of the main memory in blocks. S = N/Bs, where N is the number of cache blocks and Bs is the number of blocks per set 4. The number of bits in the main memory address = log2 (B * M) It should be noted that the total number of bits as computed by the first three equations should add up to the length of the main memory address. This can be used as a check for the correctness of your computation. Figure : Set-associative-mapped address fields 1. Use the Set field (5 bits) to determine (directly) the specified set (1 of the 32 sets). 2. Use the Tag field to find a match with any of the (four) blocks in the determined set. A match in the tag memory indicates that the specified set determined in step 1 is currently holding the targeted block, that is, a cache hit. 3. Among the 16 words (elements) contained in hit cache block, the requested word is selected using a selector with the help of the Word field. 4. If in step 2, no match is found, then this indicates a cache miss. Therefore, the required block has to be brought from the main memory, deposited in the specified set first, and the targeted element (word) is made available to the processor. The cache Tag memory and the cache block memory have to be updated accordingly. Figure: Set-associative-mapped address translation 5.4 MEASURING AND IMPROVING CACHE PERFORMANCE MEASURING TIME CPU time can be divided into the clock cycles that the CPU spends executing the program and the clock cycles that the CPU spends waiting for the memory system. The memory stall clock cycles come primarily from cache misses, and we make that assumption here. We also restrict the discussion to a simplified model of the memory system. In real processors, the stall generated by reads and writes can be quite complex, and accurate performance prediction usually requires very detailed simulations of the processor and memory system. Memory-stall clock cycles can be defined as the sum of the stall cycles coming from reads plus those coming from writes: Memory-stall clock cycles = Read-stall cycles + W rite-stall cycles The read-stall cycles can be defined in terms of the number of read accesses per program, the miss penalty in clock cycles for a read, and the read miss rate: Read-stall cycles = (Read/Write) X Read miss rate X Read miss penalty CPU time= (CPU execution clock cycles+ Memory-staU clock cycles) x Clock cycle time Cache Performance Average memory access time is a useful measure to evaluate the performance of a memoryhierarchy configuration. It tells us how much penalty the memory system imposes on each access (on average). It can easily be converted into clock cycles for a particular CPU. But leaving the penalty in nanoseconds allows two systems with different clock cycles times to be compared to a single memory system. There may be different penalties for Instruction and Data accesses. In this case, you may have to compute them separately. This requires knowledge of the fraction of references that are instructions and the fraction that are data. We can also compute the write penalty separately from the read penalty. This may be necessary for two reasons: Miss rates are different for each situation. Miss penalties are different for each situation. Treating them as a single quantity yields a useful CPU time formula: Assuming cache hit costs are included as part of the normal CPU execution cycle, then CPU time = IC × CPI × CC = IC × (CPIideal + Memory-stall cycles) × CC Memory-stall cycles come from cache misses (a sum of read-stalls and write-stalls) Read-stall cycles = reads/program × read miss rate× read miss penalty Write-stall cycles = (writes/program × write miss × write miss penalty)+ write buffer stalls For write-through caches, we can simplify this to Memory-stall cycles = miss rate × miss penalty 5.5 VIRTUAL MEMORY Virtual memory is a technique that allows the execution of processes which are not completely available in memory. The main visible advantage of this scheme is that programs can be larger than physical memory. Virtual memory is the separation of user logical memory from physical memory. This separation allows an extremely large virtual memory to be provided for programmers when only a smaller physical memory is available. Following are the situations, when entire program is not required to be loaded fully in main memory. User written error handling routines are used only when an error occurred in the data or computation. Certain options and features of a program may be used rarely. Many tables are assigned a fixed amount of address space even though only a small amount of the table is actually used. The ability to execute a program that is only partially in memory would counter many benefits. Less number of I/O would be needed to load or swap each user program into memory. A program would no longer be constrained by the amount of physical memory that is available. Each user program could take less physical memory; more programs could be run the same time, with a corresponding increase in CPU utilization and throughput. Virtual memory is commonly implemented by demand paging. It can also be implemented in a segmentation system. Demand segmentation can also be used to provide virtual memory. Demand Paging A demand paging system is quite similar to a paging system with swapping. When we want to execute a process, we swap it into memory. Rather than swapping the entire process into memory, however, we use a lazy swapper called pager. When a process is to be swapped in, the pager guesses which pages will be used before the process is swapped out again. Instead of swapping in a whole process, the pager brings only those necessary pages into memory. Thus, it avoids reading into memory pages that will not be used in anyway, decreasing the swap time and the amount of physical memory needed. Hardware support is required to distinguish between those pages that are in memory and those pages that are on the disk using the valid-invalid bit scheme. Where valid and invalid pages can be checked by checking the bit. Marking a page will have no effect if the process never attempts to access the page. While the process executes and accesses pages that are memory resident, execution proceeds normally. Access to a page marked invalid causes a page-fault trap. This trap is the result of the operating system's failure to bring the desired page into memory. But page fault can be handled as following Step 1 Check an internal table for this process, to determine whether the reference was a valid or it was an invalid memory access. Step 2 If the reference was invalid, terminate the process. If it was valid, but page have not yet brought in, page in the latter. Step 3 Find a free frame. Step 4 Schedule a disk operation to read the desired page into the newly allocated frame. Step 5 When the disk read is complete, modify the internal table kept with the process and the page table to indicate that the page is now in memory. Step 6 Restart the instruction that was interrupted by the illegal address trap. The process can now access the page as though it had always been in memory. Therefore, the operating system reads the desired page into memory and restarts the process as though the page had always been in memory. Advantages:Following are the advantages of Demand Paging    Large virtual memory. More efficient use of memory. Unconstrained multiprogramming. There is no limit on degree of multiprogramming. Disadvantages: Following are the disadvantages of Demand Paging   Number of tables and amount of processor overhead for handling page interrupts are greater than in the case of the simple paged management techniques. Due to the lack of an explicit constraints on a jobs address space size. Page Replacement Algorithm Page replacement algorithms are the techniques using which Operating System decides which memory pages to swap out, write to disk when a page of memory needs to be allocated. Paging happens whenever a page fault occurs and a free page cannot be used for allocation purpose accounting to reason that pages are not available or the number of free pages is lower than required pages. When the page that was selected for replacement and was paged out, is referenced again then it has to read in from disk, and this requires for I/O completion. This process determines the quality of the page replacement algorithm: the lesser the time waiting for page-ins, the better is the algorithm. A page replacement algorithm looks at the limited information about accessing the pages provided by hardware, and tries to select which pages should be replaced to minimize the total number of page misses, while balancing it with the costs of primary storage and processor time of the algorithm itself. There are many different page replacement algorithms. We evaluate an algorithm by running it on a particular string of memory reference and computing the number of page faults. Reference String The string of memory references is called reference string. Reference strings are generated artificially or by tracing a given system and recording the address of each memory reference. The latter choice produces a large number of data, where we note two things. For a given page size we need to consider only the page number, not the entire address. If we have a reference to a page p, then any immediately following references to page p will never cause a page fault. Page p will be in memory after the first reference; the immediately following references will not fault. For example, consider the following sequence of addresses - 123,215,600,1234,76,96 If page size is 100 then the reference string is 1,2,6,12,0,0 First In First Out (FIFO) algorithm Oldest page in main memory is the one which will be selected for replacement. Easy to implement, keep a list, replace pages from the tail and add new pages at the head. Optimal Page algorithm An optimal page-replacement algorithm has the lowest page-fault rate of all algorithms. An optimal page-replacement algorithm exists, and has been called OPT or MIN. Replace the page that will not be used for the longest period of time. Use the time when a page is to be used. Least Recently Used (LRU) algorithm Page which has not been used for the longest time in main memory is the one which will be selected for replacement. Easy to implement, keep a list, replace pages by looking back into time. Page buffering algorithm To get process start quickly, keep a pool of free frames. On page fault, select a page to be replaced. Write new page in the frame of free pool, mark the page table and restart the process. Now write the dirty page out of disk and place the frame holding replaced page in free pool. Least frequently Used (LFU) algorithm Page with the smallest count is the one which will be selected for replacement. This algorithm suffers from the situation in which a page is used heavily during the initial phase of a process, but then is never used again. Most frequently Used (MFU) algorithm This algorithm is based on the argument that the page with the smallest count was probably just brought in and has yet to be used. 5.6 TLBS Translation Look aside Buffer (TLB) A translation look aside buffer (TLB) is a cache that memory management hardware uses to improve virtual address translation speed. The majority of desktop, laptop, and server processors includes one or more TLBs in the memory management hardware, and it is nearly always present in any hardware that utilizes paged or segmented virtual memory. The TLB is sometimes implemented as content-addressable memory (CAM). The CAM search key is the virtual address and the search result is a physical address. If the requested address is present in the TLB, the CAM search yields a match quickly and the retrieved physical address can be used to access memory. This is called a TLB hit. If the requested address is not in the TLB, it is a miss, and the translation proceeds by looking up the page table in a process called a page walk. The page walk requires a lot of time when compared to the processor speed, as it involves reading the contents of multiple memory locations and using them to compute the physical address. After the physical address is determined by the page walk, the virtual address to physical address mapping is entered into the TLB. The PowerPC 604, for example, has a two-way set-associative TLB for data loads and stores. A translation look aside buffer (TLB) has a fixed number of slots containing page table entries and segment table entries; page table entries map virtual addresses to physical addresses and intermediate table addresses, while segment table entries map virtual addresses to segment addresses, intermediate table addresses and page table addresses. The virtual memory is the memory space as seen from a process; this space is often split into pages of a fixed size (in paged memory), or less commonly into segments of variable sizes (in segmented memory). The page table, generally stored in main memory, keeps track of where the virtual pages are stored in the physical memory. The TLB is a cache of the page table, representing only a subset of the page table contents. Referencing the physical memory addresses, a TLB may reside between the CPU and the CPU cache, between the CPU cache and primary storage memory, or between levels of a multi-level cache. The placement determines whether the cache uses physical or virtual addressing. If the cache is virtually addressed, requests are sent directly from the CPU to the cache, and the TLB is accessed only on a cache miss. If the cache is physically addressed, the CPU does a TLB lookup on every memory operation and the resulting physical address is sent to the cache. In a Harvard architecture or hybrid thereof, separate virtual address space or memory access hardware may exist for instructions and data. This can lead to distinct TLBs for each access type, an Instruction Translation Look aside Buffer (ITLB) and a Data Translation Look aside Buffer (DTLB). Various benefits have been demonstrated with separate data and instruction TLBs. A common optimization for physically addressed caches is to perform the TLB lookup in parallel with the cache access. The low-order bits of any virtual address (e.g., in a virtual memory system having 4 KB pages, the lower 12 bits of the virtual address) represent the offset of the desired address within the page, and thus they do not change in the virtual-tophysical translation. During a cache access, two steps are performed: an index is used to find an entry in the cache's data store, and then the tags for the cache line found are compared. If the cache is structured in such a way that it can be indexed using only the bits that do not change in translation, the cache can perform its "index" operation while the TLB translates the upper bits of the address. Then, the translated address from the TLB is passed to the cache. The cache performs a tag comparison to determine if this access was a hit or miss. It is possible to perform the TLB lookup in parallel with the cache access even if the cache must be indexed using some bits that may change upon address translation; see the address translation section in the cache article for more details about virtual addressing as it pertains to caches and TLBs. Performance implications The CPU has to access main memory for an instruction cache miss, data cache miss, or TLB miss. The third case (the simplest one) is where the desired information itself actually is in a cache, but the information for virtual-to-physical translation is not in a TLB. These are all slow, due to requiring accessing a slower level of the memory hierarchy, so a wellfunctioning TLB is important. Indeed, a TLB miss can be more expensive than an instruction or data cache miss, due to requiring not just a load from main memory, but a page walk, requiring several loads. If the page working set does not fit into the TLB, then TLB thrashing occurs, where frequent TLB misses occur, with each newly cached page displacing one that will soon be used again, degrading performance in exactly the same way as thrashing of the instruction or data cache does. TLB thrashing can occur even if instruction cache or data cache thrashing are not occurring, because these are cached in different size units. Instructions and data are cached in small blocks (cache lines), not entire pages, but address lookup is done at the page level. Thus even if the code and data working sets fit into cache, if the working sets are fragmented across many pages, the virtual address working set may not fit into TLB, causing TLB thrashing. Appropriate sizing of the TLB thus requires considering not only the size of the corresponding instruction and data caches, but also how these are fragmented across multiple pages. Multiple TLBs TLBs may have multiple levels. CPUs can be (and nowadays usually are) built with multiple TLBs, for example a small "L1" TLB (potentially fully associative) that is extremely fast, and a larger "L2" TLB that is somewhat slower. When ITLB and DTLB are used, a CPU can have three (ITLB1, DTLB1, TLB2) or four TLBs. For instance, Intel's Nehalem micro architecture has a four-way set associative L1 DTLB with 64 entries for 4 KiB pages and 32 entries for 2/4 MiB pages, an L1 ITLB with 128 entries for 4 KiB pages using four-way associativity and 14 fully associative entries for 2/4 MiB pages (both parts of the ITLB divided statically between two threads) and a unified 512entry L2 TLB for 4 KiB pages, both 4-way associative. Some TLBs may have separate sections for small pages and huge pages. TLB miss handling Two schemes for handling TLB misses are commonly found in modern architectures: With hardware TLB management, the CPU automatically walks the page tables (using the CR3 register on x86 for instance) to see if there is a valid page table entry for the specified virtual address. If an entry exists, it is brought into the TLB and the TLB access is retried: this time the access will hit, and the program can proceed normally. If the CPU finds no valid entry for the virtual address in the page tables, it raises a page fault exception, which the operating system must handle. Handling page faults usually involves bringing the requested data into physical memory, setting up a page table entry to map the faulting virtual address to the correct physical address, and resuming the program .With a hardware-managed TLB, the format of the TLB entries is not visible to software, and can change from CPU to CPU without causing loss of compatibility for the programs. With software-managed TLBs, a TLB miss generates a "TLB miss" exception, and operating system code is responsible for walking the page tables and performing the translation in software. The operating system then loads the translation into the TLB and restarts the program from the instruction that caused the TLB miss. As with hardware TLB management, if the OS finds no valid translation in the page tables, a page fault has occurred, and the OS must handle it accordingly. Instruction sets of CPUs that have software-managed TLBs have instructions that allow loading entries into any slot in the TLB. The format of the TLB entry is defined as a part of the instruction set architecture (ISA).[9] The MIPS architecture specifies a software-managed TLB; the SPARC V9 architecture allows an implementation of SPARC V9 to have no MMU, an MMU with a software-managed TLB, or an MMU with a hardware-managed TLB and the Ultra SPARC architecture specifies a software-managed TLB. The Itanium architecture provides an option of using either software or hardware managed TLBs. The Alpha architecture's TLB is managed in PAL code, rather than in the operating system. As the PAL code for a processor can be processor-specific and operatingsystem-specific, this allows different versions of PAL code to implement different page table formats for different operating systems, without requiring that the TLB format, and the instructions to control the TLB, to be specified by the architecture. 5.7 INPUT/OUTPUT SYSTEM A bus is a shared communication link, which uses one set of wires to connect multiple subsystems. The two major advantages of the bus organization are   Versatility and low cost. Accessing I/O Devices Most modern computers use single bus arrangement for connecting I/O devices to CPU & Memory The bus enables all the devices connected to it to exchange information Bus consists of 3 set of lines : Address, Data, Control Processor places a particular address (unique for an I/O Dev.) on address lines Device which recognizes this address responds to the commands issued on the Control lines Processor requests for either Read / Write The data will be placed on Data lines Hardware to connect I/O devices to Bus • Interface Circuit – Address Decoder – Control Circuits – Data registers – Status registers • The Registers in I/O Interface – buffer and control • Flags in Status Registers like SIN SOUT Flags in Status Registers like SIN SOUT • Data Registers, like Data-IN, Data-OUT The execution of an input instruction at an input device address will cause the character stored in the input register of that device to be transferred to a specific register in the CPU. Similarly, the execution of an output instruction at an output device address will cause the character stored in a specific register in the CPU to be transferred to the output register of that output device. In this case, the address and data lines from the CPU can be shared between the memory and the I/O devices. A separate control line will have to be used. This is because of the need for executing input and output instructions. In a typical computer system, there exists more than one input and more than one output device. Therefore, there is a need to have address decoder circuitry for device identification. There is also a need for status registers for each input and output device. The status of an input device, whether it is ready to send data to the processor, should be stored in the status register of that device. Similarly, the status of an output device, whether it is ready to receive data from the processor, should be stored in the status register of that device. Input (output) registers, status registers, and address decoder circuitry represent the main components of an I/O interface (module). Figure: Shared I/O arrangement The main advantage of the shared I/O arrangement is the separation between the memory address space and that of the I/O devices. Its main disadvantage is the need to have special input and output instructions in the processor instruction set. The shared I/O arrangement is mostly adopted by Intel. The second possible I/O arrangement is to deal with input and output registers as if they are regular memory locations. In this case, a read operation from the address corresponding to the input register of an input device, The main advantage of the memory-mapped I/O is the use of the read and write instructions of the processor to perform the input and output operations, respectively. It eliminates the need for introducing special I/O instructions. The main disadvantage of the memory-mapped I/O is the need to reserve a certain part of the memory address space for addressing I/O devices, that is, a reduction in the available memory address space. The memory-mapped I/O has been mostly adopted by Motorola. 5.8 PROGRAMMED I/O Figure Memory-mapped I/O arrangement The main hardware components required for communications between the processor and I/O devices. The way according to which such communications take place (protocol) is also indicated. This protocol has to be programmed in the form of routines that run under the control of the CPU. 1. The processor executes an input instruction from device 6, for example, INPUT 6. The effect of executing this instruction is to send the device number to the address decoder circuitry in each input device in order to identify the specific input device to be involved. In this case, the output of the decoder in Device #6 will be enabled, while the outputs of all other decoders will be disabled. 2. The buffers (in the figure we assumed that there are eight such buffers) holding the data in the specified input device (Device #6) will be enabled by the output of the address decoder circuitry. 3. The data output of the enabled buffers will be available on the data bus. 4. The instruction decoding will gate the data available on the data bus into the input of a particular register in the CPU, normally the accumulator. Figure : Example eight-I/O device connection to a processor The only difference will be the direction of data transfer, which will be from a specific CPU register to the output register in the specified output device. I/O operations performed in this manner are called programmed I/O. They are performed under the CPU control. A complete instruction fetch, decode, and execute cycle will have to be executed for every input and every output operation. Programmed I/O is useful in cases whereby one character at a time is to be transferred, for example, keyboard and character mode printers. 5.9 DMA AND INTERRUPTS DIRECT MEMORY ACCESS (DMA) The main idea of direct memory access (DMA) is to enable peripheral devices to cut out the ―middle man‖ role of the CPU in data transfer. It allows peripheral devices to transfer data directly from and to memory without the intervention of the CPU. Having peripheral devices access memory directly would allow the CPU to do other work, which would lead to improved performance, especially in the cases of large transfers. The DMA controller is a piece of hardware that controls one or more peripheral devices. It allows devices to transfer data to or from the system‘s memory without the help of the processor. In a typical DMA transfer, some event notifies the DMA controller that data needs to be transferred to or from memory. Both the DMA and CPU use memory bus and only one or the other can use the memory at the same time. The DMA controller then sends a request to the CPU asking its permission to use the bus. The CPU returns an acknowledgment to the DMA controller granting it bus access. The DMA can now take control of the bus to independently conduct memory transfer. When the transfer is complete the DMA relinquishes its control of the bus to the CPU. Processors that support DMA provide one or more input signals that the bus requester can assert to gain control of the bus and one or more output signals that the CPU asserts to indicate it has relinquished the bus. Figure shows how the DMA controller shares the CPU‘s memory bus. Figure : DMA controller shares the CPU’s memory bus Direct memory access controllers require initialization by the CPU. Typical setup parameters include the address of the source area, the address of the destination area, the length of the block, and whether the DMA controller should generate a processor interrupt once the block transfer is complete. A DMA controller has an address register, a word count register, and a control register. The address register contains an address that specifies the memory location of the data to be transferred. It is typically possible to have the DMA controller automatically increment the address register after each word transfer, so that the next transfer will be from the next memory location. The word count register holds the number of words to be transferred. The word count is decremented by one after each word transfer. The control register specifies the transfer mode. Direct memory access data transfer can be performed in burst mode or single cycle mode. In burst mode, the DMA controller keeps control of the bus until all the data has been transferred to (from) memory from (to) the peripheral device. This mode of transfer is needed for fast devices where data transfer cannot be stopped until the entire transfer is done. In single-cycle mode (cycle stealing), the DMA controller relinquishes the bus after each transfer of one data word. This minimizes the amount of time that the DMA controller keeps the CPU from controlling the bus, but it requires that the bus request/acknowledge sequence be performed for every single transfer. This overhead can result in a degradation of the performance. The single-cycle mode is preferred if the system cannot tolerate more than a few cycles of added interrupt latency or if the peripheral devices can buffer very large amounts of data, causing the DMA controller to tie up the bus for an excessive amount of time. The following steps summarize the DMA operations: 1. DMA controller initiates data transfer. 2. Data is moved (increasing the address in memory, and reducing the count of words to be moved). 3. When word count reaches zero, the DMA informs the CPU of the termination by means of an interrupt. 4. The CPU regains access to the memory bus. A DMA controller may have multiple channels. Each channel has associated with it an address register and a count register. To initiate a data transfer the device driver sets up the DMA channel‘s address and count registers together with the direction of the data transfer, read or write. While the transfer is taking place, the CPU is free to do other things. When the transfer is complete, the CPU is interrupted. Direct memory access channels cannot be shared between device drivers. A device driver must be able to determine which DMA channel to use. Some devices have a fixed DMA channel, while others are more flexible, where the device driver can simply pick a free DMA channel to use. Linux tracks the usage of the DMA channels using a vector of dma_chan data structures (one per DMA channel). The dma_chan data structure contains just two fields, a pointer to a string describing the owner of the DMA channel and a flag indicating if the DMA channel is allocated or not. INTERRUPT-DRIVEN I/O When the CPU is interrupted, it is required to discontinue its current activity, attend to the interrupting condition (serve the interrupt), and then resume its activity from wherever it stopped. Discontinuity of the processor‘s current activity requires finishing executing the current instruction, saving the processor status (mostly in the form of pushing register values onto a stack), and transferring control (jump) to what is called the interrupt service routine (ISR). The service offered to an interrupt will depend on the source of the interrupt. For example, if the interrupt is due to power failure, then the action taken will be to save the values of all processor registers and pointers such that resumption of correct operation can be guaranteed upon power return. In the case of an I/O interrupt, serving an interrupt means to perform the required data transfer. Upon finishing serving an interrupt, the processor should restore the original status by popping the relevant values from the stack. Once the processor returns to the normal state, it can enable sources of interrupt again. Issue of Serving Multiple Interrupts, for example, the occurrence of yet another interrupt while the processor is currently serving an interrupt. Response to the new interrupt will depend upon the priority of the newly arrived interrupt with respect to that of the interrupt being currently served. If the newly arrived interrupt has priority less than or equal to that of the currently served one, then it can wait until the processor finishes serving the current interrupt. If, on the other hand, the newly arrived interrupt has priority higher than that of the currently served interrupt, for example, power failure interrupt occurring while serving an I/O interrupt, then the processor will have to push its status onto the stack and serve the higher priority interrupt. Interrupt Hardware Computers are provided with interrupt hardware capability in the form of specialized interrupt lines to the processor. These lines are used to send interrupt signals to the processor. In the case of I/O, there exists more than one I/O device. The processor should be provided with a mechanism that enables it to handle simultaneous interrupt requests and to recognize the interrupting device. Two basic schemes can be implemented to achieve this task. The first scheme is called daisy chain bus arbitration (DCBA) and the second is called independent source bus arbitration (ISBA). Interrupt in Operating Systems The operating system saves the state of the interrupted process, analyzes the interrupt, and passes control to the appropriate routine to handle the interrupt. There are several types of interrupts, including I/O interrupts. An I/O interrupt notifies the operating system that an I/O device has completed or suspended its operation and needs some service from the CPU. To process an interrupt, the context of the current process must be saved and the interrupt handling routine must be invoked. This process is called context switching. A process context has two parts: processor context and memory context. The processor context is the state of the CPU‘s registers including program counter (PC), program status words (PSWs), and other registers. The memory context is the state of the program‘s memory including the program and data. The interrupt handler is a routine that processes each different type of interrupt. The operating system must provide programs with save area for their contexts. It also must provide an organized way for allocating and deallocating memory for the interrupted process.When the interrupt handling routine finishes processing the interrupt, the CPU is dispatched to either the interrupted process, or to the highest priority ready process. This will depend on whether the interrupted process is preemptive or nonpreemptive. If the process is nonpreemptive, it gets the CPU again. First the context must be restored, then control is returned to the interrupts process. Types of interrupts, including I/O interrupts. An I/O interrupt notifies the operating system that an I/O device has completed or suspended its operation and needs some service from the CPU. To process an interrupt, the context of the current process must be saved and the interrupt handling routine must be invoked. This process is called context switching. A process context has two parts: processor context and memory context. The processor context is the state of the CPU‘s registers including program counter (PC), program status words (PSWs), and other registers. The memory context is the state of the program‘s memory including the program and data. The interrupt handler is a routine that processes each different type of interrupt. Figure: Interrupt hardware schemes. (a) Daisy chain interrupts arrangement (b) Independent interrupt arrangement 5.10 I/O PROCESSORS. I/O Channels and Processors The Evolution of the I/O Function  Processor directly controls peripheral device  Addition of a controller or I/O module – programmed I/O  Same as 2 – interrupts added  I/O module direct access to memory using DMA  I/O module enhanced to become processor like – I/O channel  I/O module has local memory of its own – computer like – I/O processor  More and more the I/O function is performed without processor involvement.  The processor is increasingly relieved of I/O related tasks – improved performance. Characteristics of I/O Channels  Extension of the DMA concept  Ability to execute I/O instructions – special-purpose processor on I/O channel – complete control over I/O operations  Processor does not execute I/O instructions itself – processor initiates I/O transfer by instructing the I/O channel to execute a program in memory Program specifies  Device or devices  Area or areas of memory o Priority  Error condition actions Two type of I/O channels  Selector channel  Controls multiple high-speed devices  Dedicated to the transfer of data with one of the devices  Each device handled by a controller, or I/O module  I/O channel controls these I/O controllers  Multiplexor channel Can handle multiple devices at the same time  Byte multiplexor – used for low-speed devices  Block multiplexor – interleaves blocks of data from several devices.         I/O systems generally place greater emphasis on dependability & cost. I/O systems must also plan for expandability and diversity of devices. Performance plays a small role for I/O systems. Three characteristics are useful in organizing the wide variety of I/O systems. Behavior : Input (read once), Output (write only) or storage Partner : Either Human or a machine at the end of the I/O device. Data rate : The peak rate at which the data can be transferred between the i/o devices and main memeory or processor. Ex : A key board, i/p device used by a human with data rate about 10 bytes per second. Figure: A typical collection of I/O devices Figure: Typical I/O devices Interfacing I/O devices to processor memory and operating systems Giving Commands to i/o devices, Basically two techniques are used to address the devices. 1.Memory-mapped I/O : An i/o scheme in which portions of address space are assigned to i/o devices. Ex : Simple printer has 2 i/o device registers. Status register: It contains done bit and error bit. Data register : The data to be printed is put into this register. 2. Alternative method is to use dedicated i/o instructions in the processor. These specify both the device no. and command word. The processor communicates via a set of wires normally included as a part of i/o bus. Commands can be transmitted over data lines in the bus. Ex : Intel IA32, IBM 370. Communicating with the processor. Polling : The process of periodically checking the status of the i/o devices to determine the need to service the devices. Disadvantage : Waste of processor time. Interrupt Driven i/o Systems: It employs i/o interrupts to indicate the processor that an i/o device needs attention. A system can use either vector interrupts or an exception cause register. The status register determines who can interrupt the computer. A more refined blocking of interrupts is available in the interrupt mask field. There is a bit in the mask corresponding each bit in the pending interrupt field of cause register. Transferring the data between a device and memory Polling and i/o interrupts are the basic methods for implementing data transfer. Direct Memory Access : A mechanism that provides a driver controller the ability to transfer the data directly to or from memory with out involving the processor. I/O systems are evaluated on several different characteristics: Dependability, variety of i/o devices supported, cost. These goals lead to widely varying schemes for interfacing i/o devices.
Copyright © 2024 DOKUMEN.SITE Inc.