A novel High-Speed Carry Skip Adder with AOI and OAI Logic using Verilog HDL

A novel High-Speed Carry Skip Adder with AOI and OAI Logicusing Verilog HDL ABSTRACT: In this paper, we present a carry skip adder (CSKA) structure that has a higher speed yet lower energy consumption compared with the conventional one. The speed enhancement is achieved by applying concatenation and incrementation schemes to improve the efficiency of the conventional CSKA (Conv-CSKA) structure. In addition, instead of utilizing multiplexer logic, the proposed structure makes use of AND-OR-Invert (AOI) and OR-AND- Invert (OAI) compound gates for the skip logic. The structure may be realized with both fixed stage size and variable stage size styles, wherein the latter further improves the speed and energy parameters of the adder. Finally, a hybrid variable latency extension of the proposed structure, which lowers the power consumption without considerably impacting the speed, is presented. Index Terms: Carry skip adder (CSKA), high performance, hybrid variable latency adders,. Chapter-1 INTRODUCTION TO VLSI 1.1 Very-large-scale integration Very-large-scale integration (VLSI) is the process of creating integrated circuits by combining thousands of transistors into a single chip. VLSI began in the 1970s when complex semiconductor and communication technologies were being developed. The microprocessor is a VLSI device. Fig1.1 A VLSI integrated-circuit die 1.2 History During the 1920’s, several inventors attempted devices that were intended to control the current in solid state diodes and convert them into triodes. Success, however, had to wait until after World War II, during which the attempt to improve silicon and germanium crystals for use as radar detectors led to improvements both in fabrication and in the theoretical understanding of the quantum mechanical states of carriers in semiconductors and after which the scientists who had been diverted to radar development returned to solid state device development. With the invention of transistors at Bell labs, in 1947, the field of electronics got a new direction which shifted from power consuming vacuum tubes to solid state devices. With the small and effective transistor at their hands, electrical engineers of the 50s saw the possibilities of constructing far more advanced circuits than before. However, as the complexity of the circuits grew, problems started arising. Another problem was the size of the circuits. A complex circuit, like a computer, was dependent on speed. If the components of the computer were too large or the wires interconnecting them too long, the electric signals couldn't travel fast enough through the circuit, thus making the computer too slow to be effective. Jack Kilby at Texas Instruments found a solution to this problem in 1958. Kilby's idea was to make all the components and the chip out of the same block (monolith) of semiconductor material. When the rest of the workers returned from vacation, Kilby presented his new idea to his superiors. He was allowed to build a test version of his circuit. In September 1958, he had his first integrated circuit ready. Although the first integrated circuit was pretty crude and had some problems, the idea was groundbreaking. By making all the parts out of the same block of material and adding the metal needed to connect them as a layer on top of it, there was no more need for individual discrete components. No more wires and components had to be assembled manually. The circuits could be made smaller and the manufacturing process could be automated. From here the idea of integrating all components on a single silicon wafer came into existence and which led to development in Small Scale Integration(SSI) in early 1960s, Medium Scale Integration(MSI) in late 1960s, Large Scale Integration(LSI) and in early 1980s VLSI 10,000s of transistors on a chip (later 100,000s & now 1,000,000s). 1.3 Developments The first semiconductor chips held two transistors each. Subsequent advances added more and more transistors, and, as a consequence, more individual functions or systems were integrated over time. The first integrated circuits held only a few devices, perhaps as many as ten diodes, transistors, resistors and capacitors, making it possible to fabricate one or more logic gates on a single device.Now known retrospectively as small-scale integration (SSI), improvements in technique led to devices with hundreds of logic gates, known as medium-scale integration (MSI). Further improvements led to large-scale integration (LSI), i.e. systems with at least a thousand logic gates. Current technology has moved far past this mark and today's microprocessors have many millions of gates and billions of individual transistors. At one time, there was an effort to name and calibrate various levels of large-scale integration above VLSI. Terms like ultra-large-scale integration (ULSI) were used. But the huge number of gates and transistors available on common devices has rendered such fine distinctions moot. Terms suggesting greater than VLSI levels of integration are no longer in widespread use. As of early 2008, billion-transistor processors are commercially available. This is expected to become more commonplace as semiconductor fabrication moves from the current generation of 65 nm processes to the next 45 nm generations (while experiencing new challenges such as increased variation across process corners). A notable example is Nvidia's280 series GPU. This GPU is unique in the fact that almost all of its 1.4 billion transistors are used for logic, in contrast to the Itanium, whose large transistor count is largely due to its 24 MB L3 cache. Current designs, unlike the earliest devices, use extensive design automation and automated logic synthesis to lay out the transistors, enabling higher levels of complexity in the resulting logic functionality. Certain high-performance logic blocks like the SRAM (Static Random Access Memory) cell, however, are still designed by hand to ensure the highest efficiency (sometimes by bending or breaking established design rules to obtain the last bit of performance by trading stability). VLSI technology is moving towards radical level miniaturization with introduction of NEMS technology. Alot of problems need to be sorted out before the transition is actually made. Reiner Hartenstein coined the term "structured VLSI design" (originally as "structured LSI design"). In complex designs this structuring may be achieved by hierarchical nesting. achieving high accuracy in doping concentrations and etched wires is becoming more difficult and prone to errors due to variation.1. dynamic power dissipation has not scaled proportionally. microprocessor designers have encountered several challenges which force them to think beyond the design plane. and look ahead to post-silicon:  Power usage/Heat dissipation – As threshold voltages have ceased to scale with advancing process technology. When introducing the hardware description language KARL in the mid' 1970s. with many design houses opting to switch to electronic design automation (EDA) tools to automate their design process.  Timing/design closure – As clock frequencies tend to scale up.1 Challenges As microprocessors become more complex due to technology scaling. which is tolerated because of the progress of Moore's Law. .4. This has given rise to techniques such as dynamic voltage and frequency scaling (DVFS) to minimize overall power. designers are finding it more difficult to distribute and maintain low clock skew between these high frequency clocks across the entire chip. Structured VLSI design had been popular in the early 1980s. An example is partitioning the layout of an adder into a row of equal bit slices cells.4 Structured design Structured VLSI design is a modular methodology originated by Carver Mead and Lynn Conway for saving microchip area by minimizing the interconnect fabrics area. Maintaining logic complexity when scaling the design down only means that the power dissipation per area will go up. Designers now must simulate across multiple fabrication process corners before a chip is certified ready for production. Designers must keep ever more of these rules in mind while laying out custom circuits. design rules for layout have become increasingly stringent. 1.  Stricter design rules – Due to lithography and etch issues with scaling. The overhead for custom design is now reaching a tipping point. This has led to a rising interest in multicore and multiprocessor architectures. This is obtained by repetitive arrangement of rectangular macro blocks which can be interconnected using wiring by abutment. but lost its popularity later because of the advent of placement and routing tools wasting a lot of area by routing.  Process variation – As photolithography techniques tend closer to the fundamental laws of optics. echoing EdsgerDijkstra's structured programming approach by procedure nesting to avoid chaotic spaghetti-structured programs. since an overall speedup can be obtained by lowering the clock frequency and distributing processing. they surely don’t stand a chance when compared to the modern day machines.  First-pass success – As die sizes shrink (due to scaling). But what drove this change? The whole domain of computing ushered into a new dawn of electronic miniaturization with the advent of semiconductor transistor by Bardeen (1947-48) and then the Bipolar Transistor by Shockley (1949) in the Bell Laboratory. the number of dies per wafer increases. and Design for X. Modern day computers are getting smaller. faster. in accordance with the Moore’s Law. 1. and wafer sizes go up (to lower manufacturing costs). Since the invention of the first IC (Integrated Circuit) in the form of a Flip Flop by Jack Kilby in 1958. Such exponential development had never been seen in any other field and it still continues to be a major area of research work. and the complexity of making suitable photomasks goes up rapidly. including design for manufacturing (DFM). our ability to pack more and more transistors onto a single chip has doubled roughly every 18 months. and cheaper and more power efficient every progressing second. A mask set for a modern technology can cost several million dollars. Though they were heralded as the fastest computing machines of that time.2 A comparison: First Planar IC (1961) and Intel Nehalem Quad Core Die 1.5 VLSI Technology Gone are the days when huge computers made of vacuum tubes sat humming in entire dedicated rooms and could do about 360 multiplications of 10 digit numbers in a second. design for test (DFT). and encourages first- pass silicon success. This non-recurring expense deters the old iterative philosophy involving several "spin-cycles" to find errors in silicon. Fig 1. Several design philosophies have been developed to aid this new design flow.6 History & Evolution of VLSI Technology . While Simple logic gates might be considered as SSI devices and multiplexers and parity encoders as MSI. Though many improvements have been made and the transistor count is still rising. National Semiconductors. Infineon. Fairchild and National Semiconductors. power dissipation and the limit it imposed on the number of gates that could be placed on a single die. Micron Tech. Transistor-Transistor logic (TTL) offering higher integration densities outlasted other IC families like ECL and became the basis of the first integrated circuit revolution. design verification through simulation and other verification techniques. and yet it has seen as many as four generations. Embedded Systems etc. It was the time when the cost of research began to decline and private firms started entering the competition in contrast to the earlier years where the main burden was borne by the military. Early 60’s saw the low density fabrication processes classified under Small Scale Integration (SSI) in which transistor count was limited to about 10. While front end design includes digital design using HDL. Celox Networks. It was the production of this family that gave impetus to semiconductor giants like Texas Instruments. Analog Devices. Today many companies like Texas Instruments. the world of VLSI is much more diverse. It was during this time when TTL lost the battle to MOS family owing to the same problems that had pushed vacuum tubes into negligence. backend design comprises of CMOS library design and its characterization. further names of generations like ULSI are generally avoided. Intel. Hardware Descriptive Languages. Motorola and many other firms have been established and are dedicated to the various fields in "VLSI" like Programmable Logic Devices. ST Microelectronics. Synopsys. the 4004 by Intel in 1972 and the 8080 in 1974. Philips. the entire design procedure follows a step by step approach in which each design step is followed by simulation before actually being put onto the hardware or moving on to the next step. Cisco.The development of microelectronics spans a time which is even lesser than the average life expectancy of a human. Alliance Semiconductors. Generally. Mentor Graphics. Cadence. the transistor count on a single chip had already exceeded 1000 and hence came the age of Very Large Scale Integration or VLSI. The second age of Integrated circuit revolution started with the introduction of the first microprocessor. Design tools. Early seventies marked the growth of transistor count to about 1000 per chip called the Large Scale Integration. Qualcomm. By mid eighties. It also covers the physical design and fault simulation. This rapidly gave way to Medium Scale Integration in the late 60’s when around 100 transistors could be placed on a single chip. Lucent. VLSI Design VLSI chiefly comprises of Front End Design and Back End design these days. The major design steps are different levels of abstractions of the device as a whole: . the design from gates and design for testability. like RISC (Reduced Instruction Set Computer) or CISC (Complex Instruction Set Computer). outputs and timing decided upon without any details of the internal structure. power and functionality of the VLSI system.2 Floor Planning and Placement: Choosing the best layout for each block from partitioning step and the overall chip. This step is further divided into sub-steps which are: 6. number of ALU’s cache size etc. 3. 2. This part is implemented either with Hardware Descriptive Languages like VHDL and/or Verilog. It has to be a tradeoff between market requirements. Physical Design: The conversion of the netlist into its geometrical representation is done in this step and the result is called a layout. Architecture Definition: Basic specifications like Floating point units. register allocation etc. it is not possible to handle the entire circuit all at once due to limitations on computational capabilities and memory requirements. . Functional Design: Defines the major functional units of the system and hence facilitates the identification of interconnect requirements between units. 5. functionality.the realization of the circuit in the form of a netlist is done in this step. word width. ratio and spacing between components. physical dimensions. A sort of block diagram is decided upon with the number of inputs.1 Circuit Partitioning: Because of the huge number of transistors involved. Gate minimization techniques are employed to find the simplest. The major parameters considered at this level are performance. the exact positioning on the chip in order to minimize the area arrangement while meeting the performance constraints through iterative approach are the major design steps taken care of in this step. 6. This step follows some predefined fixed rules like the lambda rules which provide the exact details of the size.1. Problem Specification: It is more of a high level representation of the system. or rather the smallest most effective implementation of the logic. transistors and interconnects are put in place to make a netlist. 4. Boolean expressions. fabrication technology and design techniques. Hence the whole circuit is broken down into blocks which are interconnected. This again is a software step and the outcome is checked via simulation. control flow. Logic Design: The actual logic is developed at this level. the available technology and the economical viability of the design. 6. Gates. which system to use. Circuit Design: While the logic design gives the simplified implementation of the logic. the physical and electrical specifications of each unit. considering the interconnect area between the blocks. speed. are developed and the outcome is called a Register Transfer Level (RTL) description. The end specifications include the size. Standard Cell (Semi Custom) and Full Custom Design. Then. performance verification. while most libraries are still available for program development. Initially.4 Layout Compaction: The smaller the chip size can get. the better it is. . 7. Full Custom Design adopts a start from scratch approach where the programmer is required to write the whole set of libraries and also has full control over the block development.6. The compression of the layout from all directions to minimize the chip area thereby reducing wire lengths. First connections are completed between blocks without taking into consideration the exact geometric details of each wire and pin. This also is the same sequence from entry level designing to professional designing. in increasing order of customization support. signal delays and overall cost takes place in this design step. placement and routing. a detailed routing step completes point to point connections between pins on the blocks. which also means increased amount of overhead on the part of the programmer. 6. are FPGAs and PLDs. Routing involves the completion of the interconnections between modules. design can be done with three different methodologies which provide different levels of freedom of customization to the programmers. 6. This is completed in two steps. The design methods.5 Extraction and Verification: The circuit is extracted from the layout for comparison with the original netlist. Packaging: The chips are put together on a Printed Circuit Board or a Multi Chip Module to obtain the final finished product. and reliability verification and to check the correctness of the layout is done before the final step of packaging. While FPGAs have inbuilt libraries and a board already built with interconnections and blocks already in place.3 Routing: The quality of placement becomes evident only after this step is completed. Semi Custom design can allow the placement of blocks in user defined custom fashion with some independence. in personal computers. the Voltage scaling of threshold voltages beyond a certain point poses serious limitations in providing low dynamic power dissipation with increased complexity. this too has its limitations which have been battled and improved upon since years. we are soon approaching towards the optical limit of photolithographic processes beyond which the feature size cannot be reduced due to decreased accuracy. design issues have cropped up. But as we near the limit of miniaturization of Silicon wafers. cell phones.3: Future of VLSI Where do we actually see VLSI Technology in action? Everywhere. High speed clocks used now make it hard to reduce clock skew and hence putting timing constraints. This has opened up a new frontier on parallel processing. As the number of transistors increase. . There are certain key issues that serve as active areas of research and are constantly improving as the field continues to mature. VLSI is dominated by the CMOS technology and much like other logic families. we seem to be fast approaching the Atom- Thin Gate Oxide layer thickness where there might be only a single layer of atoms serving as the oxide layer in the CMOS transistors. If heat generated per unit area is to be considered. VLSI has come a far distance from the time when the chips were truly hand crafted. the power dissipation is increasing and also the noise. the chips have already neared that of the nozzle of a jet engine. This opened up Extreme Ultraviolet Lithography techniques. At the same time. And above all. Even on the fabrication front. the process technology has rapidly shrunk from 180 nm in 1999 to 60nm in 2008 and now it stands at 45nm and attempts being made to reduce it further (32nm) while the Die area which had shrunk initially now is increasing owing to the added benefits of greater packing density and a larger feature size which would mean more number of transistors on a chip. New alternatives like Gallium Arsenide technology are becoming an active area of research owing to this. The number of metal layers and the interconnects be it global and local also tend to get messy at such nano levels. digital cameras and any electronic gadget. Fig1. Taking the example of a processor. The figures would easily show how Gordon Moore proved to be a visionary while the trend predicted by his law still continues to hold with little deviations and don’t show any signs of stopping in the near future. dividers and memory addressing. In addition. binary numbers are more pragmatic for a given computation. DSP (Digital Signal Processor)or ASIC (Application-Specific Integrated Circuit). decimal numbers are easy to comprehend and implement for performing arithmetic. Chapter-2 INTRODUCTION TO ADDERS 2.However. such as a microprocessor. This occurs because binary values are optimally efficient at representing many values. Therefore. in digital systems. binary addition is essential that any . binary adders are also helpful in units other than Arithmetic Logic Units (ALU). Binary adders are one of the most essential logic elements within a digital system.1 Motivation To humans.such as multipliers. The power advantage is especially important withthe growing popularity of mobile and portable electronics.bit binary add operation and how the carry chain is affected. This example shows that the worst case occurs when the carry travels the longest possible path. the length of the carry chain increases. parallel-prefix adders will . Consequently.improvement in binary addition can result in a performance boost for any computing system and. which make extensive use of DSP functions.1 demonstrates an example of an 8. because they tend to set the critical path for most computations. from the least significant bit (LSB) to the most significant bit (MSB). most digital designers often resort to building faster adders when optimizing a computer architecture. The major problem for binary addition is the carry chain. parallel-prefix adders are known to have the best performance. In order to improve the performance of carry-propagate adders.1: Binary Adder Example. because of the structure of the configurable logic androuting resources in FPGAs. In VLSI implementations. Figure 2. As such. but not eliminate it. help improve the performance of the entire system. it is possible to accelerate the carry chain. However. The binary adder is the critical element in most digital circuit designs including digital signal processors (DSP) and microprocessor data path units. hence. extensive research continues to be focused on improving the power delay performance of the adder. Figure 2. As the width of the input operand increases. Reconfigurable logic such as Field Programmable Gate Arrays (FPGAs) has been gaining in popularity in recent years because it offers improved performance in terms of speed and power over DSP-based and microprocessor-based solutions for many practical designs involving mobile DSP and telecommunications applications and a significant reduction in development time and cost over Application Specific Integrated Circuit (ASIC) designs. the practical issues involved in designing and implementing Carry select adders on FPGAs are described. In this paper. Several carry select adder structures are implemented and characterized on a FPGA and compared with the CSLA with Ripple Carry Adder (RCA) and the CSLA with Binary Excess Converter. in Very Large Scale Integration (VLSI) digital systems. which have a delay approximately proportional to the width of the adder. 2. There are some additional performance enhancing schemes.3 Research Contributions The implementations that have been developed in this dissertation help to improve the design of Carry select adders and their associated computing architectures. such as the carry-lookahead adder (CLA). some conclusions and suggestions for improving FPGA designs to enable better tree-based adder performance are given. 2.have a different performance than VLSI implementations. This has the potential of impacting many application specific and general purpose computer architectures.2 Carry-Propagate Adders Binary carry-propagate adders have been extensively published. the most efficient way of offering binary addition involves utilizing parallel-prefix trees.g. including the carry-increment adder and the Ling adder that can further enhance the carry chain.In this paper. however. . ripple-carry adder (RCA). Binary adders evolve from linear adders. the practical issues involved in designing and implementing tree-based adders on FPGAs are described. some conclusions and suggestions for improving FPGA designs to enable better carry select adder performance are given. In particular. Consequently. most modern FPGAs employ a fast-carry chain which optimizes the carry path for the simple Ripple Carry Adder (RCA). heavily attacking problems related to carry chain problem. as well as impacting many areas of engineers and science. Several tree-based adder structures are implemented and characterized on a FPGA and compared with the Ripple Carry Adder (RCA) and the Carry Skip Adder (CSA). e. Finally.to logarithmic-delay adder. this occurs because they have the regular structures that exhibit logarithmic delay. this work can impact the designs of many computing systems. Finally. Chapter-3 BINARY ADDER SCHEMES Adders are one of the most essential components in digital building blocks. however. The problem of addition involves algorithms in Boolean algebra and their respective circuit implementation.carry-select adders (CSLA) and carry-increment adders (CINA) are linear-based adders with optimized carry-chain and improve upon the linear chain within a ripple-carry adder. Adders like carry-skip adders (CSKA). the performance of adders become more critical as the technology advances. Carry-lookahead adders (CLA) have .which are the most straightforward but slowest. there are linear-delay adders like ripple-carry adders (RCA). Algorithmically. a + b denotes 'a OR b' and a _ b denotes 'a XOR b'. In the following sections. 3. The + in the above equation is the regular add operation. OR and Exclusive-OR (XOR) are required. In that case. e.B and a one-bit carry-in cin along with n-bit output S.logarithmic delay and currently have evolved to parallel-prefix structures. only Boolean algebra works. However. AND.bi . In the following documentation. Other schemes. which is a rough estimation of the complexity of real implementation. Considering the situation of adding two bits. Figure 3.1: 1-bit Half Adder. Similarly. NAND/NOR adders and carry-save adders can help improve performance as well.g. the adders are characterized with linear gate model. such type of estimation can provide sufficient insight to understand the design trade-offs for adder algorithms. a _ b denotes 'a AND b'. in the binary world. bn-2……b0. This chapter gives background information on architectures of adder algorithms. S = A + B + cin: where A = an-1. si = ai^bi ci+1 = ai.1 Binary Adder Notations and Operations As mentioned previously.B = bn-1. add is done bit by bit using Boolean equations. For add related operations. the sum s and carry c can be expressed using Boolean operations mentioned above. adders in VLSI digital systems use binary notation. Although this evaluation method can be misleading for VLSI implementers. Consider a simple binary add with two n-bit inputs A. like Ling adders. a dot between two variables (each with single bit). an-2……a0. With the help of the literals above. denoted by gi and pi. The block diagram of a 1-bit full adder is shown in Figure 3. To help the computation of the carry for each bit. In the figure. the equivalence can be easily proven. There is relation between the inputs and theseliterals. Using Boolean algebra.2: 1-bit Full Adder. two binary literals are introduced.bi + ai. there is a half adder.ci si = ti^ ci .1. ci + bi . They are called carry generate and carry propagate. A full adder can be built based on Equation above. ci Figure 3.2. output carry and sum at each bit can be written as ci+1 = gi + pi . The solid line highlights the critical path. si = ai^ bi ^ ci ci+1 = ai . gi = ai. The full adder is composed of 2 half adders and an OR gate for computing carry-out. Another literalcalled temporary sum ti is employed as well.The Equation of ci+1 can be implemented as shown in Figure 3. Equation of ci+1 can be extended to perform full add operation. where there is a carry input. which indicates the longest path from the input to the output. bi pi = ai + bi ti = ai^ bi where i is an integer and 0 _ i < n. which takes only 2 input bits. For example. This type of adders is built bycascading 1-bit full adders.3. Gj-1:k Pi:k = Pi:j. The single bit carry generate/propagate can be extended to group version G and P. only pi is used as carry-propagate. Pj-1:k where i : k denotes the group term from i through k.cj 3. Each trapezoidalsymbol represents a single-bit full adder. for Ling adders. Each bit takes carry-in as one of the inputs and outputs sumand carry-out bit and hence the name ripple-carry adder. At the top of the figure. In some literatures. ci+1 = Gi:j + Pi:j. Using group carry generate/propagate. Gi:k = Gi:j + Pi:j. The following equations show the inherent relations. .2 Ripple-Carry Adders (RCA) The simplest way of doing binary addition is to connect the carry-out from the previousbit to the next bit's carry-in. the carry is rippledthrough the adder from cin to cout. carry-propagate pi can be replaced with temporary sum ti in order tosave the number of logic gates. Here these two terms are separated in order to clarify theconcepts. A 4-bit ripple-carry adder is shown in Figure 3.carry can be expressed as expressed in the following equation. 3 Carry-Select Adders (CSLA) Simple adders. All the gates have an area of 1 unit. like ripple-carry adders. the critical pathis calculated as follows. ai . highlighted with a solid line. There is a way to improve the speed by duplicating the hardware dueto the fact that the carry can only be either 0 or 1.bi si = 10/\ ai . isfrom the least significant bit (LSB) of the input (a0 or b0) to the most significant bit (MSB)of sum (sn- 1). are slow since the carry has to to travel throughevery full adder block.2) x 4 + 5}/\ = {f4n + 6}/\ As each bit takes 9 gates. including AND.3: Ripple-Carry Adder. The method is based on the conditionalsum .3 that the critical path. Using thisanalysis and assuming that each add block is built with a 9-gate full adder. or the worst delay is trca ={9 + (n. bi  ci+1 = 9/\ cisi = 5/\ ci ci+1 = 4/\ The critical path. the area is simply 9n for a n-bit RCA. Figure 3. 3. OR and XOR gate has a delayof 2/\ and NOT gate has a delay of 1/\. Assuming each simple gate. It can be observed in Figure 3. ci The relationship can be verified with properties of the group carry generate/propagate and c0i+4 can be written as c0i+4 = Gi+4:i + Pi+4:i . 0 = Gi+4:i Similarly.4.ci = Gi+4:i + Pi+4:i . The 1-bit multiplexorsfor sum selection can be implemented as Figure 3. In Figure 3.5 shows. 1 = Gi+4:i + Pi+4:i Then c0i+4 + c1i+4 . each computingthe case of the one polarity of the carry-in. Assuming the two carry terms are utilized such that the carry input is given as a constant 1 or 0: Figure 3.4: Carry-Select Adder. With two RCA.ci = Gi+4:i + (Gi+4:i + Pi+4:i) . c1i+4 can be written as c1i+4 = Gi+4:i + Pi+4:i .adder and extended to a carry-select adder.ci = ci+4 . In the figure. the sum can be obtained with a 2x1 multiplexerwith the carry-in as the select signal.ci = Gi+4:i + Gi+4:i . the adder is grouped into four 4-bit blocks.4. each two adjacent 4-bit blocks utilizes a carry relationship ci+4 = c0 i+4 + c1 i+4 . An example of 16-bit carry-select adder is shown inFigure 3.ci + Pi+4:i . For example. n = r x k. in Figure 2.s0i+1 + cj.r) + 2(k .5. The total area can be estimated 9(2n . si+1 = cj. s0 i+1 = ti+1 . and there are k .1) AND/ORlogic.4. The final delay comes from the multiplexor. each FA has an area of 9 and a multiplexor takes 5 units ofarea.5: 2-1 Multiplexor. The delay for the CSEA is 34/\ . temporary sums can be defined as follows.e. Varying the number of bits in each group can work as well for carry-select adders. i.13r + 2k .1) + 4(n .c0i s1i+1 = ti+1 .s1i+1 Assuming the block size is fixed at r-bit. Figure 3.r) FAs.2) + 5/\ = {4r + 4k + 2}/\ The area can be estimated with (2n .r) multiplexors and (k . which has a delay of 5/\. r = 4 and n = 16.2 blocks that follow. k = 4. As mentioned above. as indicated in Figure 3.2 The delay of the critical path in CSEA is reduced at the cost of increased area. The total delay for this CSEA is calculated as tcsea = 4r + 5 + 4(k . (n . The critical path with the first RCA has a delay of (4r + 5)/\ from the input to the carry-out.r) = 22n .c1i The final sum is selected by carry-in between the temporary sums already calculated. each with a delay of4/\ for carry to go through. the n-bit adder is composed of k groups ofr-bit blocks. The first delay is from the LSB input to carry-out. The area for the CSEA is 310 units while the RCA hasan area of 144 units. ci+1 = Pi:j _ Gi:j + Pi:j. Then. there are k . The CSKA has less delay in the carry-chain with only a little additional extra logic.6 shows an example of 16-bit carry-skip adder. 3. Each adder can also be modified to have avariable block sizes. The delay of the CSEA is about the half of the RCA.4 Carry-Skip Adders (CSKA) There is an alternative way of reducing the delay in the carry-chain of a RCA by checking if a carry will propagate through to the next block. the carry-in cj is allowed to get through the block immediately.cj Figure 3. The carry-out of each block is determined by selecting the carry-in and Gi:j using Pi:j. which is 4r + 5. part of the critical path is from the LSB input through the MSB output of the final RCA. the carry-out is determined by Gi:j. When Pi:j = 1.compared to 70/\ for 16-bit RCA. Each skip logic block includes one 4-input AND gate for getting Pi+3:i and one AND/OR logic. Figure 3. This is called carry-skip adders. which gives better delay and slightly less area. Further improvement can be achieved generally by making the central block sizes larger and the two-end block sizes smaller.2 skip logic blocks with a delay of 3/\. Assuming the n-bit adder is divided evenly to k r-bit blocks. Otherwise.6: Carry-Skip Adder. . But the CSEAhas an area more than twice that of the RCA. Carry- lookahead adders employ the carry generate/propagate in groups to generate carry for the next block.The final RCA has a delay from input to sum at MSB.7 shows the block diagram for an RFA.gi .7: Reduced Full Adder. Figure 3.gi+1 + pi+3 . which is called a reducedfull adder (RFA) is utilized. which is 4r+6.pi+1 . which generates generate/propagate signals in group form.8 shows an example of 16-bit carry-lookaheadadder. Figure 3.2) + 4r + 6}/\ = {8r + 3k + 5}/\ The CSKA has n-bit FA and k .gi+2 + pi+3 . When building a CLA. the total area is estimated as9n + 3(k .pi+2 . The carrygenerate/propagate signals gi/pi feed to carry-lookahead generator (CLG) for carry inputsto RFA. In the figure. Gi+3:i = gi+3 + pi+3 .2) = 9n + 3k – 6. In other words. Each skip logic block has an area of 3 units. For the 4-bit BCLG.pi+2 . each block is fixed at 4-bit. digital logic is used to calculate all the carries at once. Figure 3. Therefore.5 Carry-Look-ahead Adders (CLA) The carry-chain can also be accelerated with carry generate/propagate logic.2 skip logic blocks. a reduced version of full adder. The total delay is calculated as follows. the following equations are created. BCLG stands for Block Carry Lookahead Carry Generator. tcska = {4r + 5 + 3(k . The theory of the CLA is based on next Equations. 3. That is. which is an OR after an AND. the criticalpath traverses logarithmically.8: Carry-Lookahead Adder. therefore. ci+3 = Gi+3:i + Pi+3:i . .pi+2 .pi+1 .ci Figure 3. based on the group size. the carry-out can be computed. which is an OR after an AND. The critical path of the 16-bit CLA can be observed from the input operand through 1RFA. however.8 is from a0/b0 to s7.pi The group generate takes a delay of 4/\. The carry computation also has a delay of 4/\. as follows. the critical path shown in Figure 3. The delay will be the same for a0/b0 to s11 or s15.Pi+3:i = pi+3 . then 3 BCLG and through the final RFA. The 4- bitBCLG has an area of 14 units. Assume the CLA has n-bits.The delays are listed below. g) generated by the BCLG has a delay of 4/\.0 c4 = 4/\ c4  c7 = 4/\ c7  s7 = 5/\ a0 . which is divided into k groups of r-bit blocks. Thus. from ck+r to sk+r. there is a delay of 5/\. The critical path starts from the input to p0/g0 generation.BLCG logic and the carry-in to sum at MSB. thetotal delay is calculated as follows. the general estimation for delay and area can be derived. which amounts to an area of 16 x 8 + 5 x 14 = 198 units.The group version of (p. tcla = {2 + 8(dlogrn. Extending the calculation above. From next BCLG. The generation of (p.there is a 4/\ delay from the CLG generation and 4/\ from the BCLG generation to thenext level. g) takes a delay of 2/\. b0  p0 . Finally. which totals to 8/\. g0  G3. a0 . g0 = 2/\ p0 . Itrequires dlogrne logic levels.0 = 4/\ G3.1) + 4 + 5}/\ = {3 + 8dlogrn}/\ . b0  s7 = 19/\ The 16-bit CLA is composed of 16 RFAs and 5 BCLGs. For a large width. For each operand input bit pair the propagate-conditions are determined using an XOR-Gate (see ). When all propagate-conditions are true. Full adder with additional generate and propagate signals. then the carry-in bit determines the carry-out bit. a n-input AND-gate and one multiplexer. A good width is achieved. Each propagate bit . because the AND-gate has to be built as a tree. which would require the carry to ripple through each bit in the adder). that is provided by the carry-ripple-chain is connected to the n-input AND-gate.  This greatly reduces the latency of the adder through its critical path. The worst case for a simple one level carry-ripple-adder occurs. The resulting bit is used as the select bit of a multiplexer that switches either the last carry-bit or the carry- in to the carry-out signal . when the propagate-condition[1] is true for each digit pair . since the carry bit for each block can now "skip" over blocks with agroup propagate signal set to logic 1 (as opposed to a long ripple-carry chain. The number of inputs of the AND-gate is equal to the width of the adder. Chapter-4 Carry Skip Adder A carry-skip adder (also known as a carry-bypass adder) is an adder implementation that improves on the delay of a ripple-carry adder with little effort compared to other adders. this becomes impractical and leads to additional delays. when the sum-logic has the same depth like the n-input AND-gate and the multiplexer. . Then the carry-in ripples through the -bit adder and appears as the carry-out after . The n-bit-carry-skip adder consists of a n-bit-carry-ripple-chain. The improvement of the worst-case delay is achieved by using several carry-skip adders to form a block-carry-skip adder. since a single -bit carry-skip-adder has no real speed benefit compared to a -bit carry-ripple-adder. As the propagate signals are computed in parallel and are early available. The skip-logic consists of a -input AND-gate and one multiplexer. Block-carry-skip adders are composed of a number of carry-skip adders.The critical path of a carry-skip-adder begins at the first full-adder. . There are two types of block-carry-skip adders The two operands . the critical path for the skip logic in a carry-skip adder consists only of the delay imposed by the multiplexer (conditional skip). Carry-skip-adders are chained (see block-carry-skip-adders) to reduce the overall critical path. passes through all adders and ends at the sum-bit . Block-carry-skip adders: 16-bit fixed-block-carry-skip adder with a block size of 4 bit. 4 bit carry-skip adder. and then the most significant blocks are again made smaller so that the late arriving carry inputs can be processed quickly. Accordingly the initial blocks of the adder are made smaller so as to quickly detect carry generates that must be propagated the furthers. resulting in blocks. variable block width Fixed size block-carry-skip adders: Fixed size block-carry-skip adders split the bit of the input bits into blocks of bit each.  Why are block-carry-skip-adders used?  Should the block-size be constant or variable?  Fixed block width vs. The critical path consists of the ripple path and the skip element of the first block. and finally the ripple-path of the last block.e. all carries propagated more quickly by varying the block sizes. Multilevel carry-skip adders: . i. The optimal block size for a given adder width n is derived by equating to 0 Only positive block sizes are realizable Variable size block-carry-skip adders: The performance can be improved. the middle blocks are made larger because they are not the problem case. the skip paths that are enclosed between the first and the last block.and are split in blocks of bits. The multiplexers then control which output signal is used for COUT. Kogge-Stone adder and a number of other adder types. Lynch in. Brent and Kung. The input buses would be a 4-bit A and a 4-bit B. And the final set of full adders would assume that is a logical 1. The carry-skip optimization problem for variable block sizes and multiple levels for an arbitrary device process node was solved by Thomas W. The output would be a 4-bit bus X and a carry-out signal (COUT). with a carry-in (CIN) signal. The carry-out signal from the second full adder ( )would drive the select signal for three 2 to 1 multiplexers. Implementation overview: Breaking this down into more specific terms. 6 full adders would be needed. the Hans Carlson. and . the block-propagate signals are further summarized and used to perform larger skips: Thus making the adder even faster. in order to build a 4-bit carry-bypass adder. This problem is made complex by the fact that a carry-skip adders are implemented with physical devices whose size and other parameters also affects addition time. Carry-Skip Optimization: The problem of determining the block sizes and number of levels required to make the physically fastest carry skip adder is known as the 'carry-skip adder optimization problem'.By using additional skip-blocks in an additional layer. The second set of 2 full adders would add the last two bits assuming is a logical 0. and for some configurations identical to.[2] This reference also shows that carry- skip addition is the same as parallel prefix addition and is thus related to. The first two full adders would add the first two bits together. . Moreover. has an exponential dependence on the supply voltage level through the drain-induced barrier lowering effect [10]. Recently. In the subthreshold region. the operation of ON devices may reside in the superthreshold. the logic gate delay and leakage power exhibit exponential dependences on the supply and threshold voltages. because it results in lower delay compared with the subthreshold region and significantly lowers switching and leakage powers compared with the superthreshold region. the near-threshold region has been considered as a region that provides a more desirable tradeoff point between delay and power dissipation compared with that of the subthreshold one. In addition. There are many works on the subject of optimizing the speed and power of these units. INTRODUCTION ADDERS are a key building block in arithmetic and logic units (ALUs) [1] and hence increasing their speed and reducing their power/energy consumption strongly affect the speed and power consumption of processors. or subthreshold regions. Chapter-5 Proposed Carry Skip Adder I. the subthreshold current. which is a challenge for the designers of general purpose processors. it is highly desirable to achieve higher speeds at low-power/energy consumptions. One of the effective techniques to lower the power consumption of digital circuits is to reduce the supply voltage due to quadratic dependence of the switching energy on the voltage. The variations increase uncertainties in the aforesaid performance parameters. Working in the superthreshold region provides us with lower delay and higher switching and leakage powers compared with the near/subthreshold regions. In . these voltages are (potentially) subject to process and environmental variations in the nanoscale technologies. Obviously. the small subthreshold current causes a large delay for the circuits operating in the subthreshold region [10]. Moreover. which have been reported in [2]–[9]. near-threshold. which is the main leakage component in OFF devices. Depending on the amount of the supply voltage reduction. and parallel prefix adders (PPAs). which are also called carry look-ahead adders. carry select adder (CSLA). the speed. carry skip adder (CSKA). the Kogge–Stone adder (KSA) [15] is one of the fastest structures but results in large power consumption and area usage. The PPAs. For these systems. power consumption. Of course. In the CSLA. could be crucial in the design of high-speed. Examples include ripple carry adder (RCA). In these circuits. suffers considerably less from the process and environmental variations compared with the subthreshold region. There are different types of the parallel prefix algorithms that lead to different PPA structures with different performances. which is an efficient adder in terms of power consumption and area usage. exploit direct parallel prefix structures to generate the carry as fast as possible [14]. one may choose between different adder structures/families for optimizing power and speed. yet energy efficient. It should be noted that the structure complexities of PPAs are more than those of other adder schemes [13]. with the adder as one the main components. The RCA has the simplest structure with the smallest area and power consumption but with the worst critical path delay. near-threshold operation. In addition. In addition. As an example. which uses supply voltage levels near the threshold voltage of transistors [11]. power consumptions. the system may change the voltage (and frequency) of the circuit based on the workload requirement [12]. The dependence of the power (and performance) on the supply voltage has been the motivation for design of circuitswith the feature of dynamic voltage and frequency scaling. The descriptions of each of these adder architectures along with their characteristics may be found in [1] and [13]. the circuit should be able to operate under a wide range of supply voltage levels. The critical path delay of the CSKA is much smaller than the one in the RCA. whereas its area and power consumption are similar to those of the RCA. processors. and area usages. carry increment adder (CIA). [16]. the CSKA benefits from relatively short . was introduced in [17]. to reduce the energy consumption. There are many adder families with different delays. the power-delay product (PDP) of the CSKA is smaller than those of the CSLA and PPA structures [19]. In addition to the knob of the supply voltage. due to the small number of transistors. and area usages are considerably larger than those of the RCA. The CSKA.addition. achieving higher speeds at lower supply voltages for the computational blocks. 2) Providing a design strategy for constructing an efficient CSKA structure based on analytically expressions presented for the critical path delay. an adjustment of the structure. we have focused on reducing its delay by modifying its implementation based on the static CMOS logic. The concentration on the static CMOS originates from the desire to have a reliably operating circuit under a wide range of supply voltages in highly scaled technologies [10]. by replacing some of the middle stages in its structure with a PPA. 1) Proposing a modified CSKA structure by combining the concatenation and the incrementation schemes to the conventional CSKA (Conv-CSKA) structure for enhancing the speed and energy efficiency of the adder. which is modified in this paper. The comparatively lower speed of this adder structure. is also presented. first the related work to this adder are reviewed and then the variable latency adder structures are discussed.wiring lengths as well as a regular and simple layout [18]. II. which in turn lowers the power consumption without considerably impacting the CSKA speed. In this paper. given the attractive features of the CSKA structure. however. based on the variable latency technique. Hence. 4) Proposing a hybrid variable latency CSKA structure based on the extension of the suggested CSKA. . 3) Investigating the impact of voltage scaling on the efficiency of the proposed CSKA structure (from the nominal supply voltage to the near-threshold voltage). limits its use for high-speed applications. PRIOR WORK Since the focus of this paper is on the CSKA structure. In addition. the design of (hybrid) variable latency CSKA structures have been reported in the literature. The modification provides us with the ability to use simpler carry skip logics based on the AOI/OAI compound gates instead of the multiplexer. To the best of our knowledge. The proposed modification increases the speed considerably while maintaining the low area and power consumption features of the CSKA. no work concentrating on design of CSKAs operating from the superthreshold region down to near- threshold region and also. the contributions of this paper can be summarized as follows. The techniques presented in [19]–[24] make use of VSSs to minimize the delay of adders based on a singlelevel carry skip logic. Many methods have been suggested for finding the optimum number of the FAs [18]–[26]. in each stage. . which was presented only for the 32-bit adder. Modifying CSKAs for Improving Speed The conventional structure of the CSKA consists of stages containing chain of full adders (FAs) (RCA block) and 2:1 multiplexer (carry skip logic). The goal of this method is to decrease the critical path delay by considering a noninteger ratio of the skip time to the ripple time on contrary to most of the previous works. which are based on multiplexers and form a large part of the adder critical path delay [19]. the delay of skip logics. Alioto and Palumbo [19] propose a simple strategy for the design of a single-level CSKA. however. In addition.e. The techniques. which can be placed into one or more level structures [19]. the number of the FAs per stage) has a great impact on the speed of this type of adder [23]. Again. it had a complex layout as well as large power consumption and area usage. A. to lower the propagation delay of the adder. and the ripple time (the time required by a carry to ripple through a FA). The CSKA configuration (i. Even for the speed. has not been reduced. while the power consumption and area usage of the CSKAs were not considered. [20]. some methods to increase the speed of the multilevel CSKAs are proposed.. cause area and power increase considerably and less regular layout. The method is based on the VSS technique where the near-optimal numbers of the FAs are determined based on the skip time (delay of the multiplexer). In [25]. was not general to be applied for structures with different bits lengths. The design of a static CMOS CSKA where the stages of the CSKA have a variable sizes was suggested in [18]. which considered an integer ratio [17]. the design approach. In addition. the focus was on the speed. The RCA blocks are connected to each other through 2:1 multiplexers. In all of the works reviewed so far. the carry look-ahead logics were utilized. Conventional structure of the CSKA [19]. Notice that the voltage reduction must not increase the delays of the noncritical timing paths to become larger than the period of the clock allowing us to keep the original clock frequency at a reduced supply voltage level. B. and hence. In [27]. the efficiency of this method for reducing the power consumption of the RCA structure has been demonstrated. the power reduction were limited. In the proposed hybrid structure. When the critical timing paths in the adder are activated. 1.Fig. Improving Efficiency of Adders at Low Supply Voltages To improve the performance of the adder structures at low supply voltage levels. Therefore. the KSA has been used in the middle part of the C2SLA where this combination leads to the positive slack time increase. The method is based on the observation that the critical paths in adder units are rarely activated. . This way the power consumption reduces considerably at the cost of rather small throughput degradation. the supply voltage scaling. the structure uses two clock cycles to complete the operation. some methods have been proposed in [27]–[36]. an adaptive clock stretching operation has been suggested. the slack time between the critical paths and the off-critical paths may be used to reduce the supply voltage. The CSLA structure in [28] was enhanced to use adaptive clock stretching operation where the enhanced structure was called cascade CSLA (C2SLA). Since the slack time between the critical timing paths and the longest off-critical path was small. In [27]–[29]. Compared with the common CSLA structure. using the hybrid structure to improve the effectiveness of the adaptive clock stretching operation has been investigated in [31] and [33]. C2SLA uses more and different sizes of RCA blocks. Finally. belongs to the case where all the FAs are in the propagation mode. the power consumption and also the PDP are still high even at low supply voltages [33]. the N FAs of the CSKA are grouped in Q stages. The input signals of the jth multiplexer are the carry output of the FAs chain in the jth stage denoted by C0 j . the stage size is the same as the RCA block size..In the CSKA. For an RCA that contains N cascaded FAs. Each stage contains an RCA block with Mj FAs ( j = 1. In each stage. these two different implementations of the CSKA adder are described in more detail. and makes the carry ready for the next stage without waiting for the operation of the FA chain to be completed. where a group of cascaded FAs are in the propagate mode. is shown in Fig. This shows that the delay of the RCA is linearly related to N [1]. 1. It means that the worst case delay belongs to the case where Pi = Ai ⊕ Bi = 1 for i = 1. In Sections III-A and III-B. the inputs of the multiplexer (skip logic) are the carry input of the stage and the carry output of its RCA block (FA chain).. the carry skip logic detects this situation. In the case. N where Pi is the propagation signal related to Ai and Bi . Here. In addition to the chain of FAs in each stage.. which is based on blocks of the RCA (RCA blocks). Q) and a skip logic. CONVENTIONAL CARRY SKIP ADDER The structure of an N-bit Conv-CSKA. In addition. the carry output of the chain is equal to the carry input. III.. the carry .. the C2SLA and its hybrid version are not good candidates for low-power ALUs.. we assume Q is an integer. the worst propagation delay of the summation of two N-bit numbers.. The CSKA may be implemented using FSS and VSS where the highest speed may be obtained for the VSS structure [19]. the product of the propagation signals (P) of the stage is used as the selector signal of the multiplexer. there is a carry skip logic. Fixed Stage Size CSKA By assuming that each stage of the CSKA contains M FAs.. Based on this explanation. [22]. This statement originates from the fact that due to the logic duplication in this type of adders.However. The skip operation is performed using the gates and the multiplexer shown in the figure. A and B.there are Q = N/M stages where for the sake of simplicity. A. respectively. the speed of the CSKA may be improved. the first RCA block size may be set to one. the optimum delay of the FSS CSKA is almost proportional to the square root of the product of N and α [19]. and the output delay of a 2:1 multiplexer. and TMUX are the propagation delays of the carry output of an FA. For instance. the optimum propagation delay (TD. Hence.output of the previous stage (carry input of the jth stage) denoted by C1 j (Fig. 1). B. and 3) 3) the path of the FA chain in the last stage whose its delay is equal to the (M −1) × TCARRY + TSUM. The critical path of the CSKA contains three parts: 1) The path of the FA chain of the first stage whose delay is equal to M × TCARRY. 2) 2) the path of the intermediate carry skip multiplexer whose delay is equal to the (Q – 1) × TMUX. To determine the rate of increase. the sum output of an FA. the optimum value of M (Mopt) that leads to optimum propagation delay may be calculated as (0. let us express the propagation delay of the C1j (t1j) by . These delays are minimized by lowering sizes of first and last RCA blocks. TSUM. The speed improvement in this type is achieved by lowering the delays of the first and third terms in (1). Therefore. the critical path delay of a FSS CSKA is formulated by Based on (1).opt) is obtained from Thus. by assigning variable sizes to the stages. Note that TCARRY. Variable Stage Size CSKA As mentioned before. whereas sizes of the following blocks may increase.5Nα)1/2 where α is equal to TMUX/TCARRY. Thus.i+1. In a FSS CSKA.. note that based on Fig. To keep the same worst case delay for the critical path. in this case. . has the maximum size [24].e. to reach the highest number of input bits under a constant propagation delay. except in the first stage. This means that one could increase the size of the ( j − 1)th stage (i. for the (i +1)th stage. First. both (4) and (5) should be satisfied. In other words. the pth stage. one FA). Therefore. Having these constraints. the output delay is t1i + TMUX + TSUM. Hence. the increase in the stage size may not be continued to the lastRCA block.e. where t0j−1 (t1j−1) shows the calculating delay of C0j−1(C1j−1) signal in the ( j − 1)th stage.. we should reduce the size of the following RCA blocks. Note that. which is called nucleus of the adder. For example. in the worst case. t0j is smaller than t1j . we wish to keep the delay of the outputs of the following stages to be equal to the delay of the output of the pth stage.j . when i ≥ p. In this optimal CSKA. This may be analytically expressed as The trend of decreasing the stage size should be continued until we produce the required number of adder bits. Mj−1) without increasing the propagation delay of the CSKA. we can minimize the delay of the CSKA for a given number of input bits to find the stages sizes for an optimal structure. the delay of t0j−1 may be increased from t01 to t1j−1 without increasing the delay of C1j signal.i+1 is the delay of the (i + 1)th RCA block for calculating all of its sum outputs when its carry input is ready. we justify the decrease in the RCA block sizes toward the last stage.i+1 preventing the increase in the worst case delay (TD) of the adder. accessible after t1j + TSUM. Assuming that the pth stage has the maximum RCA block size. the size of first p stages is increased. 1. based on (3). the size of the last RCA block may only be one (i. where TSUM. while the size of the last (Q − p) stages is decreased. increasing the size of Mj for the jth stage should be bounded by Since the last RCA block size also should be minimized. the size of the (i + 1)th stage should be reduced to decrease TSUM. Therefore. Hence. For this structure. we eliminate the increase in the delay of the next stage due to the additional multiplexer by reducing the sum delay of the RCA block. the output of the jth stage is. most of the time. note that in real implementations. and hence. To satisfy (5). where α is an integer value. In this case.i is equal to (Mi − 1) × TCARRY + TSUM. one may realize only a near optimal structure. by setting M1 to 1 and using (6) and (7). Finally. α becomes equal to TSKIP/TCARRY. Thus. the exact sizes of stages for the optimal structure can be determined. when the jth stage is not in the propagate mode. (8) may be written as . the maximum of t0j is equal to Mj × TCARRY. where the estimation of the near-optimal propagation delay of the CSKA is given by [19] This equation may be written in a more general form by replacing TMUX by TSKIP to allow for other logic types instead of the multiplexer. where α is a non-integer value. the size of the last (Q − p) stages from the nucleus to the last stage should decrease based on [19] In the case. we increase the size of the first p stages up to the nucleus using [19] In addition. as detailed in [19] and [21]. the carry output of the stage is C0j. Now. In this case. the near-optimal structure is determined. [α/2] becomes equal to one. For this form. α is non-integer whose value is smaller than one. It should be noted that. in practice. the maximum of TSUM. and Q as well as the delay of the optimal CSKA may be calculated [19]. let us find the constraints used for determining the optimum structure in this case. This is the case that has been studied in [19]. Subsequently. TSKIP < TCARRY. In the case. As mentioned before. MQ . the optimal values of M1. To satisfy (4). Note that, as (9) reveals that a large portion of the critical path delay is due to the carry skip logics. Fig. 2. Proposed CI-CSKA structure IV. PROPOSED CSKA STRUCTURE Based on the discussion presented in Section III, it is concluded that by reducing the delay of the skip logic, one may lower the propagation delay of the CSKA significantly. Hence, in this paper, we present a modified CSKA structure that reduces this delay. A. General Description of the Proposed Structure The structure is based on combining the concatenation and the incrementation schemes [13] with the Conv-CSKA structure, and hence, is denoted by CI-CSKA. It provides us with the ability to use simpler carry skip logics. The logic replaces 2:1 multiplexers by AOI/OAI compound gates (Fig. 2). The gates, which consist of fewer transistors, have lower delay, area, and smaller power consumption compared with those of the 2:1 multiplexer [37]. Note that, in this structure, as the carry propagates through the skip logics, it becomes complemented. Therefore, at the output of the skip logic of even stages, the complement of the carry is generated. The structure has a considerable lower propagation delay with a slightly smaller area compared with those of the conventional one. Note that while the power consumptions of the AOI (or OAI) gate are smaller than that of the multiplexer, the power consumption of the proposed CI-CSKA is a little more than that of the conventional one. This is due to the increase in the number of the gates, which imposes a higher wiring capacitance (in the noncritical paths). Now, we describe the internal structure of the proposed CI-CSKA shown in Fig. 2 in more detail. The adder contains two N bits inputs, A and B, and Q stages. Each stage consists of an RCA block with the size of Mj (j = 1,..., Q). In this structure, the carry input of all the RCA blocks, except for the first block which is Ci , is zero (concatenation of the RCA blocks). Therefore, all the blocks execute their jobs simultaneously. In this structure, when the first block computes the summation of its corresponding input bits (i.e., SM1 ,..., S1), and C1, the other blocks simultaneously compute the intermediate results [i.e., {ZK j+Mj,..., ZK j+2, ZK j+1} for K j = j−1 r=1 Mr(j = 2,..., Q)], and also Cj signals. In the proposed structure, the first stage has only one block, which is RCA. The stages 2 to Q consist of two blocks of RCA and incrementation. The incrementation block uses the Fig. 3. Internal structure of the jth incrementation block, K j =∑ j−1 r=1 Mr (j = 2,..., Q). intermediate results generated by the RCA block and the carry output of the previous stage to calculate the final summation of the stage. The internal structure of the incrementation block, which contains a chain of half-adders (HAs), is shown in Fig. 3. In addition, note that, to reduce the delay considerably, for computing the carry output of the stage, the carry output of the incrementation block is not used. As shown in Fig. 2, the skip logic determines the carry output of the jth stage (CO,j) based on the intermediate results of the jth stage and the carry output of the previous stage (CO,j−1) as well as the carry output of the corresponding RCA block (Cj). When determining CO,j , these cases may be encountered. When Cj is equal to one, CO,j will be one. On the other hand, when Cj is equal to zero, if the product of the intermediate results is one (zero), the value of CO,j will be the same as CO,j−1 (zero). The reason for using both AOI and OAI compound gates as the skip logics is the inverting functions of these gates in standard cell libraries. This way the need for an inverter gate, which increases the power consumption and delay, is eliminated. As shown in Fig. 2, if an AOI is used as the skip logic, the next skip logic should use OAI gate. In addition, another point to mention is that the use of the proposed skipping structure in the Conv-CSKA structure increases the delay of the critical path considerably. This originates from the fact that, in the Conv-CSKA, the skip logic (AOI or OAI compound gates) is not able to bypass the zero carry input until the zero carry input propagates from the corresponding RCA block. To solve this problem, in the proposed structure, we have used an RCA block with a carry input of zero (using the concatenation approach). This way, since the RCA block of the stage does not need to wait for the carry output of the previous stage, the output carries of the blocks are calculated in parallel. B. Area and Delay of the Proposed Structure As mentioned before, the use of the static AOI and OAI gates (six transistors) compared with the static 2:1 multiplexer (12 transistors), leads to decreases in the area usage and delay of the skip logic [37], [38]. In addition, except for the first RCA block, the carry input for all other blocks is zero, and hence, for these blocks, the first adder cell in the RCA chain is a HA. This means that (Q − 1) FAs in the conventional structure are replaced with the same number of HAs in the suggested structure decreasing the area usage (Fig. 2). In addition, note that the proposed structure utilizes incrementation blocks that do not exist in the conventional one. These blocks, however, may be implemented with about the same logic gates (XOR and AND gates) as those used for generating the select signal of the multiplexer in the conventional structure. Therefore, the area usage of the proposed CI-CSKA structure is decreased compared with that of the conventional one. The critical path of the proposed CI-CSKA structure, which contains three parts, is shown in Fig. 2. These parts include the chain of the FAs of the first stage, the path of the skip logics, and the incrementation block in the last stage. The delay of this path (TD) may be expressed as where the three brackets correspond to the three parts mentioned above, respectively. Here, TAND and TXOR are the delays of the two inputs static AND and XOR gates, Note that. the stage size is the same as the RCA and incrementation blocks size. the proposed CI-CSKA structure may be implemented with either FSS or VSS.j). (10) may be modified to where TAOI and TOAI are the delays of the static AOI and OAI gates. since TAND and TXOR are smaller than TCARRY and TSUM. the average of the delays of the AOI and OAI gates. which are typically close to one another [35]. the new value for TSKIP should be used.respectively. α becomes (TAOI+TOAI) / (2×TCARRY). In the case of the FSS (FSS-CI-CSKA). which are M1 to MQ . B. In particular. Stage Sizes Consideration Similar to the Conv-CSKA structure. the following steps should be taken. the sizes of the stages. the third additive term in (11) becomes smaller than the third term in (1) [37]. and hence. 3. is used. Here. respectively. For this structure. which is shown in Fig. there are Q = N/M stages with the size of M. 1) The size of the RCA block of the first stage is one. To calculate the delay of the skip logic. The First reason is that the delay of the skip logic is considerably smaller than that of the conventional structure while the number of the stages is about the same in both structures. It should be noted that the delay reduction of the skip logic has the largest impact on the delay decrease of the whole structure. is given by In the case of the VSS (VSS-CI-CSKA). which may be obtained using (11). . are obtained using a method similar to the one discussed in Section III-B. The comparison of (1) and (11) indicates that the delay of the proposed structure is smaller than that of the conventional one. Thus. Second. The optimum value of M. [(Mj − 1)TAND + TXOR] shows the critical path delay of the jth incrementation block (TINC. The last stage. the decrease rate is more than the increase one in the case of the proposed structure. where the sum is smaller than N by d bits. 2) From the second stage to the nucleus stage. For more details on how to revise the stage sizes. The number of stages and the corresponding size for each stage. The stage is placed close to the stage with the same size. the size of the RCA block of the jth stage should be as large as possible. Therefore.i and TINC. In the case. While the increase and decrease rates in the conventional structure are balanced. 4) Starting from the stage (p + 1) to the last stage. in the Conv-CSKA structure. In particular In this case. the size of the last stage is one.i−1. respectively). 3) The increase in the size is continued until the summation of all the sizes up to this stage becomes larger than N/2. The dashed and dotted lines in the plot indicate the rates of size increase and decrease. and its RCA block contains a HA. the sizes of the stages are either not changed or increased. In the case. the size of jth stage is determined based on the delay of the product of the sum of its RCA block and the delay of the carry output of the ( j − 1)th stage. Hence. the sizes of the stage i is determined based on the delay of the incrementation block of the ith and (i − 1)th stages (TINC. It originates from the fact that. the size of the stages should be revised (Step 3). while the delay of the product of the its output sum should be smaller than the delay of the carry output of the ( j − 1)th stage. both of the stages size . note that. is considered as the nucleus ( pth) stage. which are given in Fig. Now. where the sum is larger than N by d bits. and the delay of the skip logic. 5) Finally. the procedure for determining the stage sizes is demonstrated for the 32-bit adder. 4. which has the largest size. have been determined based on a 45-nm static CMOS technology [38]. one may refer to [19]. There are cases that we should consider the stage right before this stage as thenucleu s stage (Step 5). we should add another stage with the size of d. in this case. It includes both the conventional and the proposed CI-CSKA structures. based on the description given in Section III-B. it is possible that the sum of all the stage sizes does not become equal to N. In the cases. is required [28]. Variable Latency Adders Relying on Adaptive Clock Stretching The basic idea behind variable latency adders is that the critical paths of the adders are activated rarely [33]. a hybrid variable latency CSKA structure based on the CI-CSKA structure described in Section IV is proposed. If the critical paths are not activated. Therefore. Sizes of the stages in the case of VSS for the proposed and conventional 32-bit CSKA structures in 45-nm static CMOS technology V. The imbalanced rates may yield a larger nucleus stage and smaller number of stages leading to a smaller propagation delay. A. the slack between the longest off-critical paths and the longest critical paths determines the maximum amount of the supply voltage scaling. in this structure. PROPOSED HYBRID VARIABLE LATENCY CSKA In this section. a predictor block. in the variable latency adders. the supply voltage may be scaled down without decreasing the clock frequency. Fig. first. the increase is determined based on the RCA block delay and the decrease is determined based on the incrementation block delay [according to (13)]. the structure allows two clock periods for finishing the operation. is described. . 4. the structure of a generic variable latency adder.increase and decrease are determined based on the RCA block delay [according to (4) and (5)]. which works based on the inputs pattern. while in the proposed CI-CSKA structure. for determining the critical paths activation. Hence. one clock period is enough for completing the operation. which may be used with the voltage scaling relying on adaptive clock stretching. where the critical paths are activated. Hence. Then. SLP2). which is defined by the delay difference between LLP and max(SLP1. The range of voltage scaling is determined by the slack time. Having the bits in the middle decreases the maximum of the off-critical paths [33]. the input bits ( j + 1)th–( j + m)th have been exploited to predict the propagation of the carry output of the jth stage (FA) to the carry output of ( j + m)th stage. In Fig. 5. It should be noted the paths that the predictor shows are (are not) active for a given set of inputs are considered as critical (off-critical) paths. the predictor block size should be selected based on these tradeoffs. By increasing m. There are cases that the predictor mispredicts the critical path activation. respectively) are the longest off-critical paths. the carry propagation path from the first stage to the Nth stage is the longest critical path (which is denoted by Long Latency Path (LLP). m = 6–10 may be considered [33]). adaptive clock stretching. 5. and hence.g. for a 32-bit adder. [33].Since the block has some area and power overheads. limiting the range of the voltage scaling. the number of misprediction decreases at the price of increasing the longest off- critical path. the clock stretching has a negligible impact on the throughput (e. The concepts of the variable latency adders. For this configuration. only few middle bits are used to predict the activation of the critical paths at price of prediction accuracy decrease [31]. 5. Generic structure of variable latency adders based on RCA. . Since the activation probability of the critical paths is low (<1/2m). The predictor block consists of some XOR and AND gates that determines the product of the propagate signals of considered bit positions. while the carry propagation path from first stage to the ( j+m)th stage and the carry propagation path from ( j +1)th stage to the Nth stage (which are denoted by Short Latency Path (SLP1) and SLP2. Therefore.. and also supply voltage scaling in an N-bit RCA adder may be explained using Fig. Fig. the use of the fast PPA helps increasing the available slack time in the variable latency structure. Proposed Hybrid Variable Latency CSKA Structure The basic idea behind using VSS CSKA structures was based on almost balancing the delays of paths such that the delay of the critical path is minimized compared with that of the FSS structure [21]. 6 where an Mp-bit modified PPA is used for the pth stage (nucleus stage). It should be noted that since the Conv-CSKA structure has a lower speed than that of the proposed one. Thus. which has the largest size (and delay) among the stages. we replace some of the middle stages in our proposed structure with a PPA modified in this paper. . This deprives us from having the opportunity of using the slack time for the supply voltage scaling. this block becomes parts of both SLP1 and SLP2. Since the nucleus stage. Structure of the proposed hybrid variable latency CSKA. is present in both SLP1 and SLP2. in this section. we do not consider the conventional structure. The proposed hybrid variable latency CSKA structure is shown in Fig. Fig. 6.B.It should be mentioned that since the input bits of the PPA block are used in the predictor block. replacing it by the PPA reduces the delay of the longest off-critical paths. To provide the variable latency feature for the VSS CSKA structure. using Brent–Kung parallel prefix network. the longest carry (i. G8:1) of the prefix network along with P8:1. In addition.e. which is the product of the all propagate signals of the inputs. it has a simple and regular layout. The internal structure of thestage p. which are computed by backward paths. the size of the PPA is assumed to be 8 (i. using forward paths. 7). the longest carry is calculated sooner compared with the intermediate carries. in the preprocessing level.e. 7. Internal structure of the pth stage of the proposed hybrid variable latency CSKA. Finally.Fig. including the modified PPA and skip logic. are calculated sooner than other intermediate signals in this network. for this figure. In the proposed hybrid structure.. the propagate signals (Pi) and generate signals (Gi) for the inputs are calculated. In the next level.. Note that. while the length of its wiring is smaller [14]. is shown in Fig. Mp is equal to 8 and Kp=∑ j−1 r=1 Mr. Mp = 8). One the advantages of the this adder compared with other prefix adders is that in this structure. As shown in the figure. the fan-out of adder is less than other parallel adders. The signal P8:1 is used in the skip logic to determine if the carry output of the previous . 7. the prefix network of the Brent–Kung adder [39] is used for constructing the nucleus stage (Fig. It should be noted that this implementation is based on the similar ideas of the concatenation and incrementation concepts used in the CI-CSKA discussed in Section IV. Finally. in the postprocessing level. After the parallel prefix network. In addition. this signal is exploited as the predictor signal in the variable latency adder.p−1) should be skipped or not. the slack time increases further. . which are functions of CO.stage (i.. are computed (Fig. It should be mentioned that all of these operations are performed in parallel with other stages.e. This implies that the third step discussed in that section is modified. 7). CO. when P8:1 is zero. no critical path will be activated in this case. compared with that of the nucleus stage in the original CI-CSKA structure. CO. and the last point of SPL2 is the last bit of the sum output of the incrementation block of the stage Q. the intermediate carries. In addition. 7). CO. Since the PPA structure is more efficient when its size is equal to an integer power of two.p−1 to final summation results of the PPA block and the beginning part of the SPL2 paths from inputs of this block to CO. The steps for determining the sizes of the stages in the hybrid variable latency CSKA structure are similar to the ones discussed in Section IV.p−1 and intermediate signals. It should be noted that the end part of the SPL1 path from CO. Thus. The larger size (number of bits). In the case.p−1 should skip this stage predicting that some critical paths are activated. where P8:1 is one. the first point of SPL1 is the first input bit of the first stage. we can select a larger size for the nucleus stage accordingly [14]. On the other hand.p belong to the PPA block (Fig. the output sums of this stage are calculated. leads to the decrease in the number of stages as well smaller delays for SLP1 and SLP2. In addition. similar to the proposed CI-CSKA structure.p is equal to the G8:1. and many other minor differences. the blocks themselves are executed concurrently. There are two assignment operators. However. and communicate with other modules through a set of declared input. Verilog represented a tremendous productivity improvement for circuit designers who were already using graphical schematic capture software and specially written software programs to document and simulate electronic circuits. This system allows abstract modeling of shared signal lines. designers could quickly write descriptions of large circuits in a relatively compact and concise form. demarcation of procedural blocks (begin/end instead of curly braces {}). reg. Sequential statements are placed inside a begin/end block and executed in sequential order within the block. It is also used in the verification of analog and mixed-signal circuits. etc. is most commonly used in the design. Like C.1 Overview Hardware description languages such as Verilog differ from software programming languages because they include ways of describing the propagation of time and signal dependencies (sensitivity). Chapter-6 Verilog HDL In the semiconductor and electronic design industry. and a non-blocking (<=) assignment. Verilog is a hardware description language (HDL) used to model electronic systems. The designers of Verilog wanted a language with syntax similar to the C programming language. while. for. output. where . a module can contain any combination of the following: net/variable declarations (wire. and bidirectional ports. a blocking assignment (=). and implementation of digital logic chips at the register-transfer level of abstraction. Its control flow keywords (if/else. integer. A Verilog design consists of a hierarchy of modules. case. The non-blocking assignment allows designers to describe a state-machine update without needing to declare and use temporary storage variables. Since these concepts are part of Verilog's language semantics. Modules encapsulate design hierarchy. undefined") and strengths (strong. 6. Verilog is case- sensitive and has a basic pre processor (though less sophisticated than that of ANSI C/C++). Internally. Syntactic differences include variable declaration (Verilog requires bit-widths on net/reg types. 0. verification. weak.). etc.) are equivalent. Verilog's concept of 'wire' consists of both signal values (4-state: "1. and its operator precedence is compatible. making Verilog a dataflow language. concurrent and sequential statement blocks. Verilog HDL.). not to be confused with VHDL (a competing language). and instances of other modules (sub-hierarchies). which was already widely used in engineering software development. etc. At the time of Verilog's introduction (1984). floating. OR. These extensions became IEEE Standard 1364-2001 known as Verilog-2001. code authors had to perform signed . Synthesis software algorithmically transforms the (abstract) Verilog source into a net list. Cadence now has full proprietary rights to Gateway's Verilog and the Verilog-XL. The wording for this process was "Automated Integrated Design Systems" (later renamed to Gateway Design Automation in 1985) as a hardware modeling language.2 Verilog-95 With the increasing success of VHDL at the time. Verilog-A was never intended to be a standalone language and is a subset of Verilog-AMS which encompassed Verilog-95.2. Cadence decided to make the language available for open standardization. Gateway Design Automation was purchased by Cadence Design Systems in 1990. Cadence transferred Verilog into the public domain under the Open Verilog International (OVI) (now known as Accellera) organization. can be physically realized by synthesis software. 6.2 History 6.1 Beginning Verilog was the first modern hardware description language to be invented. the wire's (readable) value is resolved by a function of the source drivers and their strengths. Verilog-2001 is a significant upgrade from Verilog-95. flip-flops. only afterwards was support for synthesis added. First. It was created by Phil Moorby and Prabhu Goel during the winter of 1983/1984. the HDL-simulator that would become the de-facto standard (of Verilog logic simulators) for the next decade. Previously. Verilog was intended to describe and allow simulation.2. etc. Originally. When a wire has multiple drivers. known as RTL (register-transfer level). a logically equivalent description consisting only of elementary logic primitives (AND.multiple sources drive a common net. NOT. In the same time frame Cadence initiated the creation of Verilog-A to put standards support behind its analog simulator Spectre. 6.) that are available in a specific FPGA or VLSI technology. Verilog modules that conform to a synthesizable coding style. it adds explicit support for (2's complement) signed nets and variables.3 Verilog 2001 Extensions to Verilog-95 were submitted back to IEEE to cover the deficiencies that users had found in the original Verilog standard. Further manipulations to the netlist ultimately lead to a circuit fabrication blueprint (such as a photo mask set for an ASIC or a bit stream file for an FPGA). A subset of statements in the Verilog language are synthesizable. commonly referred to as Verilog-95.2. Verilog was later submitted to IEEE and became IEEE Standard 1364-1995. 6. a few syntax additions were introduced to improve code readability (e. . Example A hello world program looks like this: module main. The same function under Verilog-2001 can be more succinctly described by one of the built-in operators: +. *. 6. named parameter override. A separate part of the Verilog standard. And finally. Verilog-2001 can instantiate an array of instances.operations using awkward bit-level manipulations (for example. spec clarifications. always @*. end endmodule A simple example of two flip-flops follows: moduletoplevel(clock. attempts to integrate analog and mixed signal modeling with traditional Verilog. and a few new language features (such as the uwire keyword). input reset. Verilog 2005 (IEEE Standard 1364-2005) consists of minor corrections. $finish. A generate/endgenerate construct (similar to VHDL's generate/endgenerate) allows Verilog-2001 to control instance and statement instantiation through normal decision operators (case/if/else). Verilog-AMS. -.4 Verilog 2005 Not to be confused with System Verilog.reset). Using generate/endgenerate. C-style function/task/module header declaration). >>>. with control over the connectivity of the individual instances.2. Verilog-2001 is the dominant flavor of Verilog supported by the majority of commercial EDA software packages. the carry-out bit of a simple 8- bit addition required an explicit description of the Boolean algebra to determine its correct value). reg flop1. input clock. File I/O has been improved by several new system tasks.g. initial begin $display("Hello world!"). /. "=". // connections to the module.) An example counter circuit follows: module Div20x (rst.reg flop2. When "=" assignment is used. . always @ (posedge reset or posedge clock) if (reset) begin flop1 <= 0. clk. parameter length = 20. cep. tc). for the purposes of logic. as in traditional programming. The other assignment operator. Its action doesn't register until the next clock cycle. Instead. This means that the order of the assignments is irrelevant and will produce the same result: flop1 and flop2 will swap values every clock. the compiler would understand to simply set flop1 equal to flop2 (and subsequently ignore the redundant logic to set flop2 equal to flop1. flop2 <= flop1. flop2 <= 1. had the statements used the "=" blocking operator instead of "<=". count. // TITLE 'Divide-by-20 Counter with enables' // enable CEP is a clock enable only // enable CET is a clock enable and // enables the TC output // a counter using the Verilog language parameter size = 5. In the above example. is referred to as a blocking assignment. // These inputs/outputs represent inputclk. inputrst. the target variable is updated immediately. cet. flop1 and flop2 would not have been swapped. This is known as a "non-blocking" assignment. end endmodule The "<=" operator in Verilog is another aspect of its being a hardware description language as opposed to a normal procedural language. end else begin flop1 <= flop2. b. reg a. else if (cet&&cep) // Enables both true begin if (count == length-1) count<= {size{1'b0}}. inputcep. // Other signals are of type wire // The always statement below is a parallel // execution statement that // executes any time the signals // rst or clk transition from low to high always @ (posedgeclk or posedgerst) if (rst) // This causes reset of the cntr count<= {size{1'b0}}.inputcet. reg [size-1:0] count. endmodule An example of delays: . wire e. d. outputtc. else count<= count + 1'b1. // Signals assigned // within an always // (or initial)block // must be of type reg wiretc.. . end // the value of tc is continuously assigned // the value of the expression assigntc = (cet&& (count == length-1)). c.. output [size-1:0] count. Signals that are driven from outside a process must be of type wire. and due to the blocking assignment. a is immediately assigned a new value.0 extension is automatic)  4'b1010 . $display. Signals that are driven from within a process (an initial or always block) must be of type reg. The basic syntax is: <Width in bits>'<base letter><number> Examples:  12'h123 . The examples presented here are the classic subset of the language that has a direct mapping to real gates. b is assigned a new value afterward (taking into account the new value of a). Definition of constants The definition of constants in Verilog supports the addition of a width parameter. #5 c = b.Hexadecimal 123 (using 12 bits)  20'd44 .Octal 77 (using 6 bits) Synthesizeable constructs There are several statements in Verilog that have no analog in real hardware.. c is assigned the value of b and the value of c ^ e is tucked away in an invisible store.. When one of these changes. // Mux examples .g. The keyword reg does not necessarily imply a hardware register. end The always clause above illustrates the other type of method of use. e. i.Three ways to do the same thing. d = #6 c ^ e. . b = a | b.Binary 1010 (using 4 bits)  6'o77 . always @(b or e) begin a = b & e.e. d is assigned the value that was tucked away.. much of the language can not be used to describe hardware. After a delay of 5 time units.Decimal 44 (using 20 bits . it executes whenever any of the entities in the list (the b or e) changes. Consequently. Then after 6 more time units. The output will remain stable regardless of the input signal while the gate is set to "hold". the last value at latch_out will remain and is independent of the value of din. reg out. and captures the input and stores it upon transition of the gate signal to "hold". it will pass the input to the output when the gate signal is set for "pass-through". In the example below the "pass-through" level of the gate would be when the value of the if clause is true. This is read "if gate is true. gate = 1. endcase end // Finally . 1'b1: out = a. . always@(gate or din) if(gate) out= din. assign out =sel?a : b. always@(a or b orsel) if(sel) out= a. The variable // out will follow the value of din while gate is high.// The first example uses continuous assignment wire out." Once the if clause is false. The next interesting structure is a transparent latch. // the second example uses a procedure // to accomplish the same thing. always@(a or b orsel) begin case(sel) 1'b0: out = b. i.e. the din is fed to latch_out continuously. else out= b. // Transparent latch example reg out. reg out.// Pass through state // Note that the else isn't required here.you can use if/else in a // procedural structure. Assume no setup and hold violations.e. always@(posedgeclkorposedge reset) if(reset) q <=0. A variant of the D-flop is one with an asynchronous reset. The next variant is including both an asynchronous reset and asynchronous set condition. again the convention comes into play. 1) reset goes high 2) clk goes high 3) set goes high 4) clk goes high again 5) reset goes low followed by 6) set going low. else q <= d.// When gate goes low. Note: If this model is used to model a Set/Reset flip flop then simulation errors can result. reg q. in Verilog. The significant thing to notice in the example is the use of the non-blocking assignment. else if(set) q <=1. Consider the following test sequence of events. reg q. out will remain constant. always@(posedgeclkorposedge reset orposedge set) if(reset) q <=0. else q <= d. and it can be modeled as: reg q. i. the reset term is followed by the set term. there is a convention that the reset state will be the first if clause within the statement. A basic rule of thumb is to use <= when there is a posedge or negedge statement within the always clause. always@(posedgeclk) q <= d. . the D-flop is the simplest. The flip-flop is the next significant template. This condition may or may not be correct depending on the actual flip flop. // Basic structure with an EXPLICIT feedback path always@(posedgeclk) if(gate) q <= d. The always keyword indicates a free-running process. There is a split between FPGA and ASIC synthesis tools on this structure.// the "else" mux is "implied" Note that there are no "initial" blocks mentioned in this description. The next time the always block executes would be the rising edge of clk which again would keep q at a value of 0. in this model it will not occur because the always block is triggered by rising edges of set and reset . else q <= q. A different approach may be necessary for set/reset flip flops. The reason is that an FPGA's initial state is something that is downloaded into the memory tables of the FPGA.In this example the always @ statement would first execute when the rising edge of reset occurs which would place q to a value of 0. This structure // looks much like a latch. However. ASIC synthesis tools don't support such a statement. The initial keyword indicates a process executes exactly once. Notice that when reset goes low. this is not the main problem with this model. The always block then executes when set goes high which because reset is high forces q to remain at 0. The differences are the // '''@(posedgeclk)''' and the non-blocking '''<=''' // always@(posedgeclk) if(gate) q <= d. and both .// explicit feedback path // The more common structure ASSUMES the feedback is present // This is a safe assumption since this is how the // hardware compiler will interpret it.not levels. The final basic variant is one that implements a D-flop with a mux feeding its input. that set is still high. However. These are the always and the initial keywords. Both constructs begin execution at simulator time 0. In a real flip flop this will cause the output to go to a 1. Initial and always There are two separate ways of declaring a Verilog process. The mux has a d-input and feedback from the flop itself. An ASIC is an actual hardware implementation. FPGA tools allow initial blocks where reg values are established instead of using a "reset" signal. This allows a gated load function. else d =~b.execute until the end of the block. In fact.// Set clk to 0 #1. one which terminates after it completes for the first time. //Examples: initial begin a =1. It is possible to use always as shown below: always begin// Always begins executing at time 0 and NEVER stops clk=0.e. ..// Assign the value of reg a to reg b end always@(a or b)// Any time a or b CHANGE. It is a common misconception to believe that an initial block will execute before an always block.// Assign a value to reg a at time 0 #1.// Wait for 1 time unit clk=1.// Wait 1 time unit b = a. end// Done with this block. Once an always block has reached its end. the @ event-control) always@(posedge a)// Run whenever reg a has a low to high change a <= b. run the process begin if(a) c = b.. it is better to think of the initial-block as a special-case of the always- block. These are the classic uses for these two keywords. it is rescheduled (again). The most common of these is an always keyword without the @(.} in the sense that it will execute forever. The other interesting exception is the use of the initial keyword with the addition of the forever keyword. now return to the top (i..so continue back at the top of the begin The always keyword acts similar to the "C" construct while(1) {.// Set clk to 1 #1. but there are two significant additional uses.) sensitivity list.// Wait 1 time unit end// Keeps executing . initial fork $write("A"). initialforever// Start at time 0 and repeat the begin/end forever begin clk=0. Notice that VHDL cannot dynamically spawn multiple processes like Verilog Race conditions The order of execution isn't always guaranteed within Verilog.// Wait 1 time unit end Fork/join The fork/join pair are used by Verilog to create parallel processes.// Set clk to 0 #1.// Print Char A $write("B"). .The example below is functionally identical to the always example above.// Wait for 1 time unit clk=1. The order of simulation between the first $write and the second $write depends on the simulator implementation. it is possible to have either the sequences "ABC" or "BAC" print out.// Set clk to 1 #1.// Print Char C end join The way the above is written. This allows the simulation to contain both accidental race conditions as well as intentional non-deterministic behavior. This can best be illustrated by a classic example. and may purposefully be randomized by the simulator. All statements (or blocks) between a fork/join pair begin execution simultaneously upon execution flow hitting the fork. Consider the code snippet below: initial a =0.// Print Char B begin #1. Execution continues after the join upon completion of the longest running statement or block between the fork and join.// Wait 1 time unit $write("C"). Operator Operator type Operation performed symbols ~ Bitwise NOT (1's complement) & Bitwise AND Bitwise | Bitwise OR ^ Bitwise XOR ~^ or ^~ Bitwise XNOR ! NOT Logical && AND || OR & Reduction AND ~& Reduction NAND | Reduction OR Reduction ~| Reduction NOR ^ Reduction XOR ~^ or ^~ Reduction XNOR + Addition Arithmetic . Subtraction . due to the #1 delay. or alternately zero and some other arbitrary uninitialized value.a. The $display statement will always execute after both assignment blocks have completed. Operators Note: These operators are not shown in order of precedence. it could be zero and zero. end What will be printed out for the values of a and b? Depending on the order of execution of the initial blocks.b). initial begin #1. $display("Value a=%b Value of b=%b".initial b = a. and X (unknown logic value). . 1. a dedicated standard for multi-valued logic exists as IEEE 1164 with nine levels. } Concatenation Replication {n{m}} Replicate value m for n times Conditional ?: Conditional Four-valued logic The IEEE 1364 standard defines a four-valued logic with four states: 0. For the competing VHDL. Z (high impedance). 2's complement * Multiplication / Division ** Exponentiation (*Verilog-2001) > Greater than < Less than >= Greater than or equal to <= Less than or equal to Relational == Logical equality (bit-value 1'bX is removed from comparison) Logical inequality (bit-value 1'bX is removed from != comparison) === 4-state logical equality (bit-value 1'bX is taken as literal) !== 4-state logical inequality (bit-value 1'bX is taken as literal) >> Logical right shift << Logical left shift Shift >>> Arithmetic right shift (*Verilog-2001) <<< Arithmetic left shift (*Verilog-2001) Concatenation { . . 1 Introduction to FPGA FPGA contains a two dimensional arrays of logic blocks and interconnections between logic blocks. Logic blocks are programmed to implement a desired function and the interconnections are programmed using the switch boxes to connect the logic blocks. To be more clear. Both the logic blocks and interconnects are programmable. Chapter-7 FPGA Implementation 7. to get our desired design (CPU). if we want to implement a complex design (CPU for instance). all the sub functions implemented in logic blocks must be connected and this is done by programming the internal structure of an FPGA which is depicted in the following figure 7. .1. Now. then the design is divided into small sub functions and each sub function is implemented using one logic block. An LUT is used to implement number of different functionality.1: FPGA interconnections FPGAs. The input lines to the logic block go into the LUT and enable it. The output of the LUT gives the result of the logic function that . can be used to implement an entire System On one Chip (SOC). they are slow compared to custom ICs as they can’t handle vary complex designs and also they draw more power.” Custom ICs are expensive and takes long time to design so they are useful when produced in bulk amounts. Xilinx logic block consists of one Look Up Table (LUT) and one Flip-Flop. alternative to the custom ICs. Some disadvantages of FPGAs are. Figure 7. User can reprogram an FPGA to implement a design and this is done after the FPGA is manufactured. no mask making. and no IC manufacturing). But FPGAs are easy to implement within a short time with the help of Computer Aided Designing (CAD) tools (because there is no physical layout process. This brings the name “Field Programmable. The main advantage of FPGA is ability to reprogram. . For Example: 2-LUT can be used to implement 16 types of functions like AND. Advantage of such an architecture is that it supports implementation of so many logic functions. Each of the latch hold’s the value of the function corresponding to one input combination.. OR. A +not B. Wire segments are . A k-input LUT based logic block can be implemented in number of different ways with tradeoff between performance and logic density. Typically an FPGA has logic blocks. Switch blocks lie in the periphery of logic blocks and interconnect. A sequence of one or more wire segments in an FPGA can be termed as a track. interconnects and switch blocks (Input /Output blocks).2 shows a 4-input LUT based implementation of logic block LUT based design provides for better logic block utilization.An n-LUT can be shown as a direct implementation of a function truth-table. Figure 7. SRAM is used to implement a LUT.it implements and the output of logic block is registered or unregistered output from the LUT.. Interconnects A wire segment can be described as two end points of an interconnection with no programmable switch between them. Number of different possible functions for k input LUT is 2^2^k. however the disadvantage is unusually large number of memory cells required to implement such a logic block in case number of inputs is large. Etc.A k-input logic function is implemented using 2^k * 1 size SRAM. 3 FPGA Design Flow 7. Schematic based entry gives designers much more visibility into the hardware. . Depending on the required design. Hardware Description Language and combination of both etc.1 Design Entry There are different techniques for design entry. Figure 7. If the designer wants to deal more with Hardware. one logic block is connected to another and so on. A simplified version of design flow is given in the flowing diagram.2 FPGA DESIGN FLOW In this part of tutorial we are going to have a short intro on FPGA design flow.2. Another method but rarely used is state-machines. When the design is complex or the designer thinks the design in an algorithmic way then HDL is the better choice. Schematic based. Language based entry is faster but lag in performance and density. then Schematic entry is the better choice.connected to logic blocks through switch blocks. Selection of a method depends on the design and designer. It is the better choice for the designers who think the design as a series of . It is the better choice for those who are hardware oriented. 7. HDLs represent a level of abstraction that can isolate the designers from the details of the hardware implementation. 2. In this documentation we are going to deal with the HDL based design entry. If the design contains more than one sub designs. But the tools for state machine entry are limited.2.states.2 Synthesis Figure 7. then the synthesis process generates netlist for each design element Synthesis process will check code syntax and analyze the hierarchy of the design which ensures that the design is optimized for the design architecture. 7. The resulting netlist(s) is saved to an NGC (Native Generic Circuit) file (for Xilinx® Synthesis Technology (XST)). 7.3 Implementation This process consists of a sequence of three steps  Translate  Map  Place and Route Translate: .e. to implement a processor.4 FPGA Synthesis The process that translates VHDL/ Verilog code into a device netlist format i. a complete circuit with logical elements (gates flip flop. etc…) for the design. we need a CPU as one design element and RAM as another and so on. the designer has selected. ex. assigning the ports in the design to the physical elements (ex.5 FPGA Translate Map: Process divides the whole circuit with logical elements into sub blocks such that they can be fit into the FPGA logic blocks. buttons etc) of the targeted device and specifying time requirements of the design. Input Output Blocks (IOB)) and generates an NCD (Native Circuit Description) file which physically represents the design mapped to the components of FPGA. This information is saved as a NGD (Native Generic Database) file. Figure 7. MAP program is used for this purpose. Figure 7.6 FPGA map Place and Route: . pins. Constraint Editor Etc. defining constraints is nothing but. switches. This information is stored in a file named UCF (User Constraints File). This can be done using NGD Build program. Tools used to create or modify the UCF are PACE. Process combines all the input netlists and constraints to a logic design file. Here. That means map process fits the logic defined by the NGD file into the targeted FPGA elements (Combinational Logic Blocks (CLB). Ex. 32-entry. The output NCD file consists of the routing information. many different devices were available in the Xilinx ISE tool. the RTL model will be converted to the gate level netlist mapped to a specific technology library. In order to synthesis this design the device named as “XC3S500E” has been chosen and the package as “FG320” with the device speed such as “- 4”. . PAR program is used for this process. single write port registerfile. if a sub block is placed in a logic block which is very near to IO pin. dual read ports. flip-flops and MUX.3 Synthesis Result To investigate the advantages of using our technique in terms of area overhead against “Fully ECC”and against the partially protection. It shows the inputs and outputs of the system. Figure 7. we implemented andsynthesized for a Xilinx XC3S500E different versions of a32-bit. Here in this Spartan 3E family.7 FPGA Place and route 7. RTL Schematic The RTL (Register Transfer Logic) can be viewed as black box after synthesize of design is made. Once the functional verification is done. By double-clicking on the diagram we can see gates. then it may save the time but it may affect some other constraint. the RTL model is taken to the synthesis process using the Xilinx ISE tool. The place and route process places the sub blocks from the map process into logic blocks according to the constraints and connects the logic blocks. The PAR tool takes the mapped NCD file as input and produces a completely routed NCD file as output. In synthesis process. So tradeoff between all the constraints is taken account by the place and route process. PADDER4BIT u3(.s(s[5:3]).cout(c[1])).cin(c[0])... .b(b[5:3]).a(a[5:3])..b.s(s[9:6]).cin(c[1]).s(s[2:1]).01 ... cin...a(a[2:1]).... output cout.. wire [8:0]c.a(a[9:6]).cout(c[0])). output [31:0] s.cout(c[2])).cin(cin). PADDER3BIT1 u2(.cin(c[2]).b.File Created // Additional Comments: // ////////////////////////////////////////////////////////////////////////////////// module PCSKA32_VSS(a.a(a[0]). s)... input [31:0] a... fa u0(. input cin.s(s[0]).Source code for Variable Stage Size Carry Skip Adder: `timescale 1ns / 1ps ////////////////////////////////////////////////////////////////////////////////// // Company: // Engineer: // // Create Date: 12:40:01 08/13/2016 // Design Name: // Module Name: PCSKA32_VSS // Project Name: // Target Devices: // Tool versions: // Description: // // Dependencies: // // Revision: // Revision 0. PADDER2BIT u1(. cout.b(b[2:1]).cout(c[3]))..b(b[9:6]).b(b[0]). Figure 7..cout(c[6])).Cout(c[5]))..a(a[31:28])...14: RTL schematic of Internal block Variable Stage Size Carry Skip Adder .B(b[22:15]).b(b[31:28])..cin(c[5]). BrentKung8 u5(...b(b[14:10]). endmodule The corresponding schematics of the adders after synthesis is shown below.a(a[27:23]).Cin(c[4]).s(s[14:10]).. PADDER5BIT1 u4(.S(s[22:15]). PADDER5BIT u6(..s(s[27:23])..b(b[27:23]).cout(cout))...13: RTL schematic of Top-level Variable Stage Size Carry Skip Adder Figure 7..cin(c[3]). PADDER4BIT1 u7(.A(a[22:15])...cin(c[6]).a(a[14:10])..s(s[31:28]).cout(c[4])). 15: Technology schematic of Top-level Variable Stage Size Carry Skip Adder Figure 7. Figure 7.17: Internal block Variable Stage Size Carry Skip Adder .16: Technology schematic of Internal block Variable Stage Size Carry Skip Adder Figure 7. 7. the device utilization in the used device and package is shown below.4 Synthesis Report This device utilization includes the following. Table 7-1: Synthesis report of Variable Stage Size Carry Skip Adder . Hence as the result of the synthesis process.  Logic Utilization  Logic Distribution  Total Gate count for the Design The device utilization summery is shown above in which its gives the details of number of devices used from the available devices and also represented in %. The synthesis and simulation are performed on Xilinx ISE 14. Chapter-8 SIMULATION RESULTS All the synthesis and simulation results are performed using Verilog HDL. The corresponding simulation results of the variable stage size carry skip adders are shown below. The simulation results are shown below figures.4. Figure 8-1: Test Bench for 16 bit Variable Stage Size Carry Skip Adder Figure 8-2: Simulated output for Variable Stage Size Carry Skip Adder . and KSA structures. which exhibits a higher speed and lower energy consumption compared with those of the conventional one. C2SLA. RCA. In addition. The results also suggested the CI-CSKA structure as a very good adder for the applications where both the speed and energy consumption are critical. a hybrid variable latency extension of the structure was proposed. AOI and OAI compound gates were exploited for the carry skip logics. CIA. and hybrid C2SLA structures. CSKA structure called CI-CSKA was proposed. . In addition. The efficiency of the proposed structure for both FSS and VSS was studied by comparing its power and delay with those of the Conv-CSKA. The efficacy of this structure was compared versus those of the variable latency RCA. SQRT-CSLA. The speed enhancement was achieved by modifying the structure through the concatenation and incrementation techniques. Again. the suggested structure showed the lowest delay as a better candidate for high-speed applications. The results revealed considerably lower PDP for the VSS implementation of the CI-CSKA structure over a wide range of voltage from super-threshold to near threshold. CONCLUSION In this paper. Nagendra. Mathew. 55. MA. (ICCD).. 13. Kittur. 40. [3] S. Dao. no. Feb. S. (VLSI) Syst. Mathew. Solid-State Circuits. H. [8] Y. Irwin. Chang. Koren. K. 6. vol. R. 44–51. “Low-power and area-efficient carry select adder. Oklobdzija. T.” IEEE Trans. [7] C. Vratonjic.” IEEE Trans. Papers. no. and V. Conf. [5] B. 2. Zeydel. 20. II. B.18 μm full adder performances for tree structured arithmetic circuits. J. Anders.-H. Comput.. pp. Krishnamurthy. Oklobdzija. 249–252. USA: A K Peters.” IEEE J. REFERENCES [1] I. J.” IEEE Trans. 1. “A 4-GHz 300-mW 64-bit integer execution ALU with dual supply voltages in 90-nm CMOS. Nikolic. Signal Process. vol. Bloechel.and ultra low-power arithmetic units: Design and comparison. vol. 371–375. 6. 336–346. Circuits Syst. 44. Chang. B. 754–758. 10. S. Q. 2002. 2008. 2009. 2012. “Low. A. IEEE Int. Ltd. Computer Arithmetic Algorithms. [6] M.. Oct. no. “Area-time-power tradeoffs in parallel adders. Oct. no. 2005.” IEEE Trans. Circuits Syst. Jan. He and C.-H. pp. Process. M.. Owens. 2. 2005. pp. Ramkumar and H. pp. and B. M. Very Large Scale Integr. “A review of 0. and R. 1996. Borkar. Analog Digit. Zeydel. M. Jun. “Energy–delay optimization of 64-bit carry-lookahead adders with a 240 ps 90 nm CMOS design example. [2] R. no. [4] V. Zlatanovici. Natick. pp. M.” in Proc. and S.. 689–702. R. 2005. 13. vol. Feb. [9] C. . 2nd ed. “Comparison of high-performance VLSI adders in the energy-delay space. no. 2005. VLSI Comput. Design. vol. (VLSI) Syst. “A power-delay efficient hybrid carrylookahead/carry-select based redundant binary to two’s complement converter. no. G. and R. Very Large Scale Integr. pp. 43. G. 569–583. R. Reg. (VLSI) Syst. vol. pp. Jun. K. 686–695. B. Solid-State Circuits. Krishnamurthy. Nguyen. Very Large Scale Integr. 1. pp. and M. Kao. Feb.” IEEE J. I. vol. Gu.” IEEE Trans. Zhang. Jun. “A static low-power. Alarcon.” IEEE Trans. Burla. Dao. [15] P. 2. Solid-State Circuits Conf. vol. Design (DSD). “Binary adder architectures for cell-based VLSI and their synthesis. “A 280 mV-to-1. Technol.. “A parallel algorithm for the efficient solution of a general class of recurrence equations. I. G. IEEE Conf. vol. Swiss Federal Inst. Alioto and G. 615–619. Technol. “Skip techniques for high-speed carrypropagation in binary arithmetic units. M. B. [16] V. Syst. 2004. Krishnamurthy. T.D. Majerski. and R. EC-16. EC-10. M. 16th IEEE Symp. vol. Signals.. D. pp.” IRE Trans. pp. 1967. 98. and T. Rec. Liu. vol. Jain et al. Feb. [11] R. IEEE. “On determination of optimal distributions of carry skips in adders.-T. 2. (ETH). vol. pp. 2003. Feb.. [19] M. vol.” in Proc. Feb. Switzerland. 1961. 141–148.” IEEE Trans. Aug. Lehman and N. [13] R.” Ph.2 V wide-operating-range IA-32 processor in 32 nm CMOS. 66–68. Dig.” in Proc.” in IEEE Int. “A taxonomy of parallel prefix networks. Chirca et al.. Kogge and H. pp. no. Fundam. Aug. Digit. Comput. Comput. Eng. pp. Comput. IEEE. “A simple strategy for optimized design of one-level carry-skip adders. Dept. H. S. Blaauw. C. Feb.. no. Dec. 98. high-performance 32-bit carry skip adder. no. Mathew.. Comput.[10] D. Mudge. C-22. Syst. pp. Circuits Syst. Markovic. pp. “Ultralow-power design in near-threshold region. Zimmermann. 2012. S.. 4. 37th Asilomar Conf. Wieckowski.” Proc. 1.” Proc. Zürich. [17] M. 2213–2217. Palumbo. 45–58. “Near-threshold computing: Reclaiming Moore’s law through energy efficient integrated circuits. 1973. pp. “Energy-delay estimation technique for highperformance microprocessor VLSI adders. Electron. 2.. 691–698. and J. D. Theory Appl. 237–252. vol. Stone. 2003. Wang. Papers (ISSCC). Inf. Sylvester. Elect. 786–793. 2010. M. G. . Euromicro Symp. 8. Arithmetic.. L. [18] K. Comput. 1998. no. no. Rabaey. [14] D. Zeydel. 272–279. Tech.” IEEE Trans./Sep.” in Proc. Electron. C. R. Nov. 2010. 253–266. Oklobdzija. Harris. 2003. 1. Jan. no. P. pp. pp. [12] S. 50. [20] S. Dreslinski. dissertation. Arithmetic. Design. “A way to build efficient carryskip adders. Kantabutra. [24] V. pp. Low Power Electron.. 2003. 1989. Li. no. 11. 2005. Aug. “Exploring high-speed low-power hybrid arithmetic units at scaled supply and adaptive clock-stretching. Suzuki. Chan.” IEEE Trans. Roy.. H. K. [23] P. “Variable-latency adder (VL-adder): New arithmetic circuit design practice to overcome NBTI. pp. Dec. 1389–1393. G. J. Low Power Electron. 8. Guyot. and K. B. 9th IEEE Symp. W. pp. Symp. Roy. 2003. [25] V. Sep. 313–318.” in Proc. [28] H.” in Proc. Oct. C.” in Proc.-K. pp. “Delay optimization of carry-skip adders and block carry-lookahead adders using multidimensional dynamic programming. Schlag. Symp.. 2004. [26] S. and J. Nov. 41. Electron Devices Solid-State Circuits. Hochet. IEEE Conf. 1992. Aug. Comput.” in Proc. 21st Int. M.” in Proc.” IEEE Trans.. Turrini. Roy. D. Jeong. Koh. 920–930. Conf. Int. ACM/IEEE Int. “Designing optimum one-level carry-skip adders. Li. Design (ISLPED). Chen. 195–200. pp. 759–764. Suzuki. vol. vol. Li. Design (ISLPED). [27] H.” IEEE Trans. Oct. vol. 103–106. Aug. 115–118. Jeong. Comput. Roy. W. “Accelerated two-level carry-skip adders—A type of very fast adders. 509–512. Comput. pp. pp. vol.” IEEE Trans. and C. C-36. no. Jia et al. “Low power adder with adaptive supply voltage. 96–103. [30] Y.” in Proc. [29] Y. Comput. “Cascaded carry-select adder (C2SA): A new structure for low-power CSA design. 42. Low Power Electron. and C. “Low-power carry-select adder using adaptive supply voltage based on input vector patterns. Ghosh and K.. 1144–1152. “Optimal group distribution in carry-skip adders. 42. Jun. and V. 1993. “Static CMOS implementation of logarithmic skip adder. Muller. pp. Comput. 2007. Aug. Oklobdzija. Asia . no.-M. pp. 6. 1987. Kantabutra. [22] S. Chen. Thomborson. Design (ISLPED). Comput. F. no. pp. H. 1993.[21] A. 10. [31] S. K. and K. Koh. Symp. Int.” in Proc. D.-K. and H. Karakonstantis. [Online]. Brent and H. pp. Liu. Yang. 18. 10. Exhibit. “A regular layout for parallel adders. Sun. pp. and K.. 18.. Du. no. NJ. S.” IEEE Trans. [37] J. Quality Electron. 2003. vol. accessed Dec. Digital Integrated Circuits: A Design Perspective. [38] NanGate 45 nm Open Cell Library. Roy.South Pacific Design Autom. Mohapatra. 3. 1982. Chang. 2nd ed. Very Large Scale Integr.-C. 824–830. Conf. vol. Chandrakasa. [33] S. Su. 2010. Very Large Scale Integr. 9. (VLSI) Syst.” IEEE Trans. Mohanram. [39] R. Zhu. no. Oct. no.” IEEE Trans. pp. D.com/. Available: http://www.” IEEE Trans. [32] Y.. [Online].” in Proc. 1874–1883. Nikolic. Autom. Sep. vol. pp. “Performance optimization using variable-latency design style. Mar. and M. USA: Prentice-Hall. Marek-Sadowska. Varman. G. Design (ISQED). Conf. vol. [40] Synopsys HSPICE. 2011. Englewood Cliffs. Available: http://www. A. Nov.. “High performance reliable variable latency carry select addition. pp. 11. D. Mar. [35] Y. 635–640. pp. M. Y.synopsys. T.-C. “Voltage scalable high-speed robust hybrid arithmetic units using adaptive clocking. (ASPDAC). Ghosh. no. 1301–1309.” in Proc. 2012. “Design methodology of variable latency adders with multistage function speculation. Very Large Scale Integr. 2010. . “Variable-latency adder (VL-adder) designs for low power and NBTI tolerance. (VLSI) Syst. (DATE). C-31. 2010. pp.com. Mar. [34] Y. IEEE 11th Int. Comput. and K. Chen et al. P. Kung. Test Eur. 19. 1621–1624. 2010. 1257–1262. Design.-S. 2008. (VLSI) Syst. Symp. Y. Mar. [36] K.nangate. and B. Rabaey.. P. accessed Sep.. Wang. 260–264. 2011.