An introduction to FPGAChristoph Heer December 2002 Abstract This document aims to give an overview of the technology of FPGAs (Field-Programmable Gate Arrays). It focuses on aspects of the architecture and gives insights into the design flow. FPGA devices are compliant with standard CMOS technology, with the exception of those FPGAs which use flash or fuse technology. With processes below 0.2 µm, macros of a reasonable capacity of some 10,000 gate equivalents can be embedded on-chip, constituting Configurable Systems-onChip (CSoC). Today, chip platforms integrate a standard microprocessor core, SRAM and an FPGA. Future systems will contain specialized cores. An introduction to FPGA Christoph Heer, December 2002 Contents Chapter 1 - Introduction References Chapter 2 - FPGA Architecture 2.1 Basic Structure 2.2 The Configurable Logical Cell 2.2.1. Simple Transistor/Multiplexer/Gate-Based Cells 2.2.2. LUT-Based Cells 2.2.3. PAL/PLA-Based Cells 2.2.4. ALU-Based Cells 2.3 Routing Structures 2.4 FPGA Configuration 2.5 Distributed SRAM 2.6 Input / Output Cells References Chapter 3 - FPGA Design Flow Appendix A - PAL / PLA Architecture Appendix B - List of Relevant Acronyms An introduction to FPGA Christoph Heer, December 2002 1 Introduction Digital integrated circuits may be broadly classified into three categories: 1.Programmable logic 2.Application-specific logic 3.Programmable standard architectures Programmable logic is typically a means of storing amounts of data for quick access 1 , in either a volatile or non-volatile manner with respect to the power supply. The programmability of such devices could be one-time only, repeatable, or even dynamic (in the case of RAMs). Application specific logic is typically highly optimised in terms of functionality, performance, power and cost. The highest degree of optimisation is obtained with full custom implementation, while semi-custom devices offer quicker design processes. Programmable standard architectures are highly flexible, generic devices, the functionality of which is determined by loaded software (program code). However, the processing time of a function is long because the code is executed sequentially; such devices are therefore typically made use of in applications which allow these longer response times. Gate arrays provide a highly standardised means to implement digital integrated circuit designs. They are manufactured as regular arrays of patterned blocks of transistors which can be interconnected to form logic elements such as gates, flip-flops and multiplexers. The advantage is that the manufacturer can pre-produce gate array wafers without interconnections in high-volume. These are then configured in an additional process step in the factory. Once a customer provides a definition of the logic block interconnections, one or more layers of metal are added to form these connections. Sea-of-gates structures are slightly different in that, unlike regular gate arrays, where blank routing space is provided at regular intervals in the transistor array, added metal interconnects have to be placed over particular transistors, rendering them unusable. The advantage is a better area utilisation. These two types of devices are collectively known as MPGAs (MaskProgrammable Gate Arrays). As process technologies advance and sizes get smaller, it is becoming increasingly more expensive to configure such devices. FPGAs (Field-Programmable Gate Arrays) and CPLDs (Complex Programmable Logic Devices) 2 are digital devices based on configurable logical cells and configurable interconnect structures. They are manufactured using the latest technologies and very high capacity in equivalent ASIC gates. The Altera APEX 20KC for example reaches capacities of 1.5 million gates using 0.15 µm technology [1]. Unlike MPGAs, the configuration step does not involve a technological process but 1 A PLA / PAL may also be considered as a memory device, if the input vector to the array is viewed as an address vector and the output of the array as the contents of the memory location uniquely determined by that input / address. 2 As is explained in Section 2.2.3, CPLDs may be considered to be a type of FPGA, and throughout this document, unless otherwise specified, the term FPGA will be used to refer to both FPGAs and CPLDs. The nature and complexity of the two types of devices are similar, even though they differ very much in architecture and possibly in the type of application too. An introduction to FPGA Christoph Heer, December 2002 is done electrically. Re-configuration is therefore an option, during system boot-up and possibly dynamically during run-time, though one-time programmable FPGAs also exist. FPGA devices provide a very high degree of flexibility based on a standard architecture producible in large quantities. They support the implementation of a wide range of circuit types and offer a lot of potential for parallel processing. In this respect they appear superior to DSP architectures. The fact that there is no need to generate a mask to configure FPGA architecture means that the hardware implementation of logic circuits is faster and that small quantities may be produced at a reasonable cost. FPGAs can be used for fast functional verification during the development phase, avoiding the long waiting times associated with simulation. The cost of prototyping and time-to-market of new designs is therefore reduced, as is the cost for small-volume production of particular designs. Most FPGAs are re-configurable even after the chip would have been put into application. In particular, FPGA macros which are embedded together with standardised cores on the same die allow further flexibility. Thus, for example, if one such embedded FPGA macro is used in a communications transceiver, changes in the communications protocol may be taken care of simply by re-configuring the eFPGA, rather than re-designing the whole transceiver. All these advantages however come with an incurred increase in signal delay and power consumption, and worse utilisation of chip area when compared to equivalent logic circuits implemented in full-custom or semi-custom. To summarise, systems implemented using FPGAs offer the following advantages and disadvantages over semi-custom and full-custom devices: Advantages: • Fast and cheap procedure for implementing hardware • Fast functional verification • Low cost of low-volume production • Improved time-to-market • Re-configurability in the field Disadvantages: • Non-optimal utilisation of silicon area • Signal delay and power consumption are higher • Routing problems could limit flexibility • Potential clock-skew problems Despite these disadvantages, the market of stand-alone FPGA devices has in recent years exploded into a billion-dollar business and further growth is expected as process technologies improve. The main benefit of flexibility without the costs of mask generation will then be even more significant. Since FPGAs are compatible with standard CMOS processes, the embedding of FPGA macros into larger designs will be a common technique in the imminent future. The following market models are foreseeable: 1.Programmable once: • derivatives of standard devices • low cost of customisation even in low quantities • protection of intellectual property as read-outs of programmed gate arrays are harder to obtain than those of full-custom designs 2.Re-programmable: • prototyping and functional development on standard platforms • in-field customisation and updating An introduction to FPGA Christoph Heer, December 2002 • multiple-application hardware In conclusion, although FPGAs are sub-optimal in terms of physical implementation, they offer great potential for producing standard cores which are individually customisable at low cost. References [1] Altera, Data Sheet, APEX™ 20KC Programmable Logic Device, ver. 1.1, April 2000. An introduction to FPGA Christoph Heer, December 2002 2 FPGA Architecture 2.1 Basic Structure Figure 2.1 - Basic FPGA architecture [1]. The basic architecture of an FPGA (Figure 2.1) is an array of identical, configurable logical cells. The periphery of the device consists of a number of configurable input/output cells. The array is interwoven with configurable interconnect resources and switches, which provide connection routes between all these elements. Additionally FPGAs may have small RAM blocks distributed in the array; these may also be configured to provide one logically lumped memory unit. The array of configurable logical cells may be structured in several ways, as shown in Figure 2.2. a)Symmetric matrix b)Rows of cells c)Sea of cells: this term refers to the fact that no dedicated routing resources exist between the structured logical cells but instead they are switched through the cells. d)Hierarchical structure An introduction to FPGA Christoph Heer, December 2002 Figure 2.2 - a) Symmetric matrix architecture b) Rows c) Sea of cells d) Hierarchy [2a]. An FPGA device is generally designed to allow the implementation of practically any logic circuit. This however requires an area trade-off between a sufficient number of flexible configurable logical cells and enough interconnect resources to allow all connections between these cells. As the majority of circuits will only utilise a small portion of routing and logic resources, this results in a loss in speed (incurred by signal passing through redundant routing elements) and density of logic when compared to the same circuit implemented in dedicated logic. An interesting concept is the grouping of different FPGA devices with related architecture into a family [3]. Each member in a family would be physically tailored to a certain class of application architecture, by for example replacing the switches in certain routes by hard shorts, or hard-wiring the logical cells internally in a certain manner. This member may now implement certain circuits more efficiently, but its reduced flexibility means that some circuits may not fit at all onto the device. Implementation of a circuit is now a question of choosing the right device from the FPGA family. The IEEE Std. 1149.1 Joint Test Action Group (JTAG) standard describes boundary-scan test circuitry which facilitates functional verification and debugging of FPGA cores by allowing the observation of logic nodes without the need to bring these nodes externally via an I/O pin. Dynamic configuring of the FPGA may also be done through the JTAG interface. An introduction to FPGA Christoph Heer, December 2002 2.2 The Configurable Logical Cell The CLC (Configurable Logical Cell) is used to implement a number of logic functions (generally one or two) of a larger number of inputs. A cell may consist of various combinations of the following elements: • • • • • • Transistors Basic gates (NAND, XOR, ... ) Flip-flops Multiplexers Look-up tables (LUTs) AND-OR arrays (sum-of-products) The term granularity refers to a quantification of the complexity of the CLC and can depend on the following: • • • Number of logical functions which may be implemented by each CLC Number of equivalent NAND2 gates of each CLC Total number of transistors that physically constitute the CLC An FPGA device of higher granularity therefore consists of a larger number of less complex CLCs, requiring more complex interconnections. FPGAs can therefore be classified according to the granularity of their array structures. Arrays of gates or transistors represent the highest extreme of the granularity scale, while arrays of microprocessors or ALUs are at the other end, since the CLCs in this case are of very high complexity and require simpler interconnect resources. 2.2.1 Simple Transistor / Multiplexer / Gate-Based Cells Figure 2.4 - Cell of transistor chains [2a]. An introduction to FPGA Christoph Heer, December 2002 The most basic type of configurable logical cell consists of simple groupings of transistors. Programmable devices based on such cells are conceptually very similar to gate arrays and require complex routing to implement large logic circuits. Figure 2.4 shows a logical cell formed of transistor chains. As a second example of a device with high granularity, Figure 2.5 shows a simple CLC based on multiplexers and a standard OR gate. This is used in the Actel 40MK family. The 8input, 1-output cell can implement basic logic gates (NAND, AND, OR, NOR) with 2, 3 or 4 inputs. Efficient use of interconnecting resources allows the implementation of any logic function, including flip-flops, by wiring a number of gates together. Figure 2.5 - Actel 40MK CLC [4]. An introduction to FPGA Christoph Heer, December 2002 2.2.2 LUT-Based Cells Most FPGAs use logical cells which are based on Look-up Table (LUTs), the largest exception being CPLDs. An LUT is realised as a number of memory locations (e.g. SRAM) which are set during the configuration phase. During operation, the vector of input signals selects one memory location, the content of which is switched to the output of the LUT. This is implemented by means of pass transistors. In the example LUT shown in Figure 2.6, depending on the inputs A, B and C, a path is switched through a decision tree of depth three. The contents of the memory cell (in this case 1 bit) corresponding to that path then appear at the output. Using this architecture any combinational function of the three inputs may be implemented. An LUT with more inputs can implement more logic, thereby reducing the number of logical cells needed and with it the chip area needed to provide the routing between the cells ( Figure 2.3). However, LUT complexity grows exponentially with the number of inputs. Previous research [5] has shown that a 4-input LUT is the most efficient in terms of area and most commercial FPGA vendors in fact use LUTs of this size. It is also common practice to use two LUTs in parallel. The two outputs could either be dynamically selected using a multiplexer or propagated as two output ports of the logical cell. In the first case a logical cell of 4 inputs, for instance, could be implemented using two 3-input LUTs and one multiplexer which is switched by the fourth cell input. The benefit of splitting the LUT is increased flexibility in configuring the logical cell. Figure 2.6 - LUT architecture [2b]. Whilst the LUT implements combinational logic circuits, logical cells must also contain flip-flops to be able to implement sequential logic. Figure 2.7 shows a simplified CLC for a typical FPGA. Figure 2.7 - Basic CLC architecture [6]. 11 Figure 2.8 - CLC configurations [7]. This simple cell can now be configured in several modes to implement various basic types of digital circuit (Figure 2.8). The most common configurations are: • • • Synthesis mode: Any logic function of up to 4 variables in its registered or direct form. Arithmetic mode: The LUT is split to provide any two logic functions of the same 3 variables. In the arithmetic mode, the inputs A, B, C are the addends and the Carry-in, whilst the output functions are the Sum and the Carry-out. Multiplier mode: This mode also implements an adder, with the addends this time being partial products and Carry-in from the previous bit position. The partial product of A and B may be implemented with an AND gate. In the case of the Atmel AT40K device [7], from which these configurations were sourced, an AND gate is included in the architecture of the CLC for this purpose, avoiding the wasteful reservation of an LUT input to implement such a simple function. Counter mode: The LUT provides two logic functions (counter Output and Carryout) of the same 2 variables, which are a Carry-in and the previous Output. The feedback loop to use this output as an input is normally provided for within the CLC; this could also be implemented externally by connecting appropriate routes. Multiplexer (2:1) mode: The LUT is configured to provide a logic function of 3 variables, where one selects one of the other two inputs. As an example, the case where C is the select line for A and B will be considered. In this case the 1-bit memory cells in the LUT are configured to implement the following truth table: • • 12 A 0 0 1 1 0 0 1 1 B C D 0 1 0 1 0 1 0 1 0 0 0 0 1 1 1 1 x x x x x x x x O/p 0 0 1 1 0 1 0 1 Note that some configurations, namely Arithmetic, Counter and Multiplier modes, require 2 distinct functions of 3 inputs. Both Atmel and Actel in fact provide an architecture with two separate 3-input LUTs. This is equivalent to a 4-input LUT in terms of the number of gates required to implement the LUT. In other words, extending the 3-input LUT in Figure 2.6 to a 4-input LUT involves inserting a fourth input line D and increasing the depth of the tree to four, which requires an additional 16 pass transistors. Since each 3-input LUT contains 14 transistors, having two 3input LUTs and using D to select which of the outputs will be registered by the flipflop also results in an additional 16 transistors. Figure 2.9 shows one such example in the Actel Varicore CLC [8] (ignore the Carry In and Carry Out lines at this point). Other devices, like Altera APEX 20K [9], have 4-input LUTs with a second output specifically providing a Carry-out line. Additionally, the CLCs contain interfacing logic to the routing resources and in some cases specialised functionality such as fast carry and cascade chains to speed up arithmetic operations. The internal connectivity of the each cell is determined by a number of multiplexers which can be used to configure all possible inter-connections between LUT, flip-flop and local routing lines. 13 Figure 2.9 - Actel Varicore CLC architecture [8]. An example of an LUT-based CLC of higher complexity (5 inputs / 2 outputs) is the Xilinx XC3000 CLC [10], which uses a 5-input LUT and 2 flip-flops to implement more complex functions with less number of cells. The obvious penalty is less efficient CLC utilisation. 14 Figure 2.10 - Xilinx XC3000 CLC [10]. 2.2.3 PAL/PLA-Based Cells Complex Programmable Logic Device (CPLDs) are also devices with high cellcomplexity. The CLC of a CPLD is not-surprisingly called a Simple Programmable Logic Device (SPLD) and is based on sum-of-products (also called AND-OR) logic. Each SPLD is made up of a PAL or PLA3 , macrocells and input / output structures. The PLA / PAL produces a number of product terms which are functions of the inputs to the SPLD. The number of macrocells per PLA / PAL determines how many different logic functions may be obtained from a selection of the same set of product terms. Whether the OR logic is lumped in the PLA / PAL cell or the macrocell block is simply a question of labelling and is manufacturer-dependent. 3 The difference between a PAL and a PLA is explained in Appendix A. 15 Figure 2.11 - Generic CPLD architecture [11]. As with other FPGAs, all the logical cells can be interconnected using routing resources, though in the case of the CPLD, these tend to be simpler and based on signal lines running through the whole device, a characteristic of a low-granularity device. This also means that delays between cells are predictable. Figure 2.12 shows the CLC of Altera CPLDs, consisting of AND gates with high fan-in (gates with more than 20 inputs) which converge in OR gates of 3 to 8 inputs. This structure allows the implementation of complex logic functions using a minimal amount of CLCs, reducing the required number of interconnections. In practice though it is very difficult to use the array to its maximum complexity, so density is wasted. 16 Figure 2.12 - Altera CLC architecture [2c]. 2.2.4 ALU-Based Cells FPGAs based on arrays of ALUs have recently appeared on the market as very lowgranularity programmable devices. Companies offering such solutions, or in the process of developing them, include Adaptive Silicon, LSI (architecture licensed from Adaptive Silicon), PACT corporation and Elixent. Arrays of statically programmed ALUs can be configured into synchronous DSP pipelines yielding powerful instruction level parallelism. Figure 2.13 - Array of 4-bit ALUs [12]. 17 2.3 Routing Structures Four types of routing networks are needed in an FPGA device: • • • • Power feeding network Reset and multiple clock networks (local / global) Signal network interconnecting all cells Configuration lines A strategy adopted by most manufacturers to different extents is the structuring of the device into some sort of hierarchy, by segmenting the array into groups of CLCs. Routing lines interconnecting the cells could then be broadly classified into three different types: • • • Local routing lines directly interconnecting neighbours Interconnects to route signals within a cluster of cells Global interconnects to transmit signals throughout the whole array Local routing lines are of low fan-out and limited length. The switching in this case is done from within the CLC, to create fast point-to-point interconnections useful for fast arithmetic operations for instance. These connections allow the most efficient implementation of standard structures (as are multiplier elements, shift registers, etc.) in terms of utilisation and speed. 18 Figure 2.14 - Example of a routing resource using programmable switches [2a]. Routing within a cluster of cells is done by means of a matrix of interconnection lines, which may be configured to realise connections between any two CLCs or between one CLC and an I/O cell. Different routes are made using routing resources which consist of configurable pass transistor switches. An example of such a resource is shown in Figure 2.14. Emphasis has to be made on the importance of having efficient CAD tools which make good utilisation of the CLCs and place for minimal distance. Each of the switches in such programmable routing resources is equivalent to an RC element, meaning that it introduces a propagation delay to the signal. Figure 2.15 shows how the route between two CLCs, passing through a switching matrix and two programmable interconnection points (PIPs in Xilinx terminology) which connect the cell to a line, may be represented by an equivalent RC model. With FPGA devices of high granularity, the routing resources are more complex, meaning that there are a large number of very different routes between two cells, each of which has a very different associated delay. For this reason, low-granularity devices have more easily predictable delays between cells. Global interconnects require strong signal driving and do not use the above mentioned routing matrices. They enable the transmission of global signals to all CLCs with minimal delay and attenuation of logic levels. Because of large distances, there could be the need for signal refresh using tri-state buffers. 19 Figure 2.15 - Breakdown of route into equivalent electrical model [2a]. The level of connectivity between cells in the FPGA has a direct effect on the total area of the circuit. Recent advancements in the semiconductor technology process has increased the number of metal layers available for interconnection (from 2 to 7 layers), albeit at a cost. Extra layers can be used to reduce the amount of area required for more complex interconnectivity and allow the allocation of specific layers to particular functions such as power supply and clock signals. Different FPGA manufacturers have adopted very different solutions to the complex question of routing between cells in an FPGA device. Therefore the routing architectures of the different devices will be addressed in more detail in the chapters concerning the particular devices. 2.4 FPGA Configuration FPGA devices allow the configuration of all CLCs, I/O cells and interconnect resources. The gate of each configurable transistor is controlled by the contents of a 1bit memory cell, with a logic '0' or logic '1' determining whether the gate is off or on. 20 To reduce the wiring required for configuration, the memory cells can be connected in a chain and the configuration is then loaded using a shift operation. Depending on the physical configuration mechanism, it is possible to classify FPGAs into three classes: • • • One-time configurable devices Non-volatile re-configurable devices Volatile re-configurable devices One-time programmable devices store configuration using fuses or anti-fuses. The former are normally closed structures, while the latter are normally open. A device based on fuse technology is programmed by physically breaking the connections between appropriate structures. On the other hand, a device based on anti-fuses is programmed by melting interconnections between particular cells to generate contacts. The Actel eX [13], mX [4] and sX [14] families are based on anti-fuse structures. In the case of re-programmable devices, activation or deactivation of interconnects is implemented by means of pass transistors or tri-state buffers (Figure 2.16). Memory units also store the configuration of LUTs and static multiplexers in the CLC. If the type of memory used is EEPROM, the device is non-volatile, but the difficult mechanism of re-configuration imposes limitations on the application of the system. SRAM memory, on the other hand, loses the configuration once power is removed from the device (volatile), but it is simple and quick to configure. The use of SRAM allows for dynamic re-configuration of the device even during real-time operation. Small local SRAM blocks may also be used to store several configuration bits. In this case, unlike in the application of SRAM blocks for ordinary data storage, there is no need for a select of the read lines. Figure 2.16 - Configuration of FPGA devices [15]. In commercial applications, a separate PROM device is used to store the configuration, which is then loaded into the FPGA SRAM at system start-up via a special configuration interface which usually allows both serial and parallel configuration modes. In systems which combine eFPGA cores with microprocessor cores, the processor could load new configurations into the FPGA. To facilitate system testing and debugging, many devices support read-out of configuration. The 21 IEEE Std. 1149.1 JTAG standard describes boundary-scan circuitry which allows the observation and configuration of individual elements for such purposes. 2.5 Distributed SRAM Several applications require the use of local memory units. For this purpose, many FPGAs include small SRAM blocks, which are distributed in an array-like structure throughout the device. This is known as distributed RAM and could be configured as one logical RAM unit. This type of RAM offers faster access by the FPGA and more flexibility of configuration of the memory as well as of the communication between different processes and memory blocks, when compared to a lumped memory block external to the FPGA core. In most cases these distributed memory blocks can be configured as multiple independent synchronous / asynchronous, single-port / dualport RAM blocks, often offering a compromise between the width of the address and data busses. For example, Altera's FLEX 10K [16] allows the following configurations: 256x8, 512x4, 1024x2, 2048x1. The LUT in an LUT-based CLC could be looked at as a small memory unit with the flip-flop used to latch the output. Some FPGA devices, like the Xilinx XC4000 Series [17], also allow the configuration of several CLCs into distributed RAM, though of course this implies a loss in logic resources. In the survey carried out on commercially available FPGAs, the only type of distributed RAM described was that implemented as SRAM blocks distributed throughout the device. 2.6 Input / Output Cells An important aspect of flexibility on an architectural level is the interface between an IC and external circuitry. There may be the need to support different bus standards with the same core logic, or to allow different IC pin-outs as required by different board layouts. The input / output cells on an FPGA device are programmable blocks situated on the periphery of the circuit. As an example, the basic structure of the IO cell of the Xilinx XC4000 Series [17] will be examined, as shown in the simplified block diagram in Figure 2.17. In general, it may be assumed that other manufacturers use similar architecture in the IO cells of their devices; if however there are large differences, then these are explained in the respective sections. 22 Figure 2.17 - Simplified block diagram of XC4000E Series IOC [17]. The structure incorporates the following features: • D flip-flops which could be used to provide sequential buffering of the input or output line. • The tri-state output buffer may be put in a state of high impedance by means of an activate signal, implementing tri-state outputs or bi-directional I/O. • The output slew rate may be controlled at the configuration stage. • The output pull-up device may be configured with either an n-channel transistor, pulling to one threshold level below Vcc or p-channel transistor to pull up to Vcc. • The input thresholds can be configured for either TTL or CMOS logic levels. • Programmable pull-up and pull-down resistors are used to tie floating pins to Vcc or ground respectively. References [1] J. Carrabina, F. Lisa and A. J. Velasco, Implementación con FPGAs, Chapter 11 from the book Sistemas Digitales, 2000. [2] S.A. Bota Ferragut, FPGAs, Internal Communication, Universitat de Barcelona [2a] Chapter 1. Introducción [2b] Chapter 3. Arquitectura Logic Cell Array (LCA) de Xilinx [2c] Chapter 5. Arquitectura Multiple Array Matrix (Max-plus) de Altera [3] V. Betz and J. Rose, Using Architectural “Families” to Increase FPGA Speed and Density, University of Toronto. 23 [4] Actel, Data Sheet, 40MX and 42MX FPGA Families, ver. 5.0, February 2001. [5] J. Rose, R. J. Francis, D. Lewis and P. Chow, Architecture of Programmable Gate Arrays: The Effect of Logic Block Functionality on Area Efficiency, IEEE Journal of Solid State Circuits, Oct. 1990, pp. 1217 - 1225. [6] V. Betz and J. Rose, How Much Logic Should Go in an FPGA Logic Block?, University of Toronto. [7] Atmel, Data Sheet, AT40K FPGAs, January 1999. [8] Actel, Data Sheet, VariCore™ EPGA™ Family, rel. 1.0, February 2001. [9] Altera, Data Sheet, APEX™ 20K PLD Family, ver. 4.0, August 2001. [10] Xilinx, Data Sheet, XC3000 Series FPGAs, ver. 3.1, November 1998. [11] A. Dhir, Introducing Xilinx and Programmable Logic Solutions for Home Networking, ver. 1.0, March 2001. [12] www.elixent.com [13] Actel, Data Sheet, eX Family FPGAs, ver. 0.3, March 2001. [14] Actel, Data Sheet, 54SX Family FPGAs, ver. 3.0.1, May 2000. [15] V. Betz and J. Rose, FPGA Routing Architecture: Segmentation and Buffering to Optimise Speed and Density, University of Toronto. [16] Altera, Data Sheet, FLEX® 10K Embedded PLD Family, ver. 4.1, March 2001. [17] Xilinx, Data Sheet, XC4000E and XC4000X Series FPGAs, ver. 3.1, ver 1.6, May 1999. 24 3 FPGA Design Flow The process of circuit design on FPGA devices is highly automated and involves the use of flexible and powerful CAD tools. The efficiency of the tools used has a direct impact on the overall design time and the efficiency of the FPGA implementation: • Design Entry. This is the starting point of the design process and involves capturing the design using a high-level description language like Verilog or VHDL. Alternatively a schematic editor is used to enter the design at basic logic level, or by making use of generic blocks which in turn are described by highlevel languages. Other possibilities include entry of the design using state diagrams. The CAD software provided by FPGA manufacturers includes libraries of standard circuits or macro-functions to quickly implement common circuits of varying complexity. The schematic or VHDL description are then translated into a netlist describing the circuit in terms of logic gates and sequential elements. Logic Synthesis. This tool optimises the circuit by regrouping logic functions and/or removing redundancies. Such optimisation is carried out according to design constraints or rules, which could be minimising area or maximising velocity. Once the optimised netlist is obtained, it has to be mapped onto the logical cell of the FPGA (LUT / flip-flop, PLA ... ). The aim of this is to minimise the total number of CLCs to be used. Floorplanning. The circuit to be designed is now divided into partitions, each of which is adjusted to be implemented in a particular area on a FPGA device. A partition usually corresponds to a large section of the circuit which has a particular functionality, e.g a multiplier, filter bank etc. In this step, the total number of FPGA devices required is also determined. Place and Route. A logic partition is now mapped onto an FPGA device by means of the placement tool, which assigns a physical place in the array of CLCs to each function (LUT / flip-flop, PLA ... ). Typical placement algorithms aim to minimise the total length of the interconnections in the final design, with the objective of maximising the speed of the device. Routing algorithms configure the routing elements to provide the required connections between logic elements. The primary aim of any routing algorithm is to assure that 100% of the required routes may be realised. Other goals of routing algorithms include finding the shortest paths possible between elements. Because of restricted interconnection resources, this step is the most restrictive. Layout Verification. This step involves extracting the physical layout of the design and simulating it using commercial simulators to obtain timing data and checking design rules (DRC). If the delays associated with the interconnections within the prototype indeed fulfil delay constraints imposed by the design specifications, then the device may be programmed, otherwise the placement and routing steps have to be repeated until a satisfactory configuration is found. Macro Integration. This involves the provision of all the necessary files and data formats for integrating the macro in the design flow of the whole chip. 25 • • • • • Once the circuit would have been verified, the design configuration is output in a format which is readable as an input to the FPGA device which is to be programmed. The programming of the device could be a question of minutes. 26 Appendix A - PAL / PLA Structure Figure A.1 PAL / PLA structure [1]. A PLA (Programmable Logic Array) provides a structured form of implementing combinational functions which are in the form of sum-of-products of a number of 27 input lines to the device. As shown in Figure A.1, PLAs are built of two distributedgate arrays. These 2 arrays are programmed by forming a connection between the array input lines and the logic gate (AND, OR) inputs. The first array provides the products (and is therefore known as the AND plane) and the second provides the desired sum of these products (and is known as the OR plane). A PAL device is a variation in which the OR plane is fixed. References [1] Xilinx, Data Sheet. CoolRunner XPLA3 CPLD, ver. 1.4, April 2001. 28 Appendix B - List of Relevant Acronyms ASIC CLC CPLD CSoC DRC eFPGA FPGA IOC JTAG LUT MPGA PLA / PAL PLD SoC SPLD Application-Specific Integrated Circuit Configurable Logical Cell Complex PLD Configurable SoC Design Rule Check embedded FPGA Field-Programmable Gate Array Input / Output Cell Joint Test Action Group Look-Up Table Mask-Programmable Gate Array Programmable Logic Array Programmable Logic Device System-on-Chip Simple PLD 29