Technical POC:
Keshab K. Parhi
Professor, Dept. of Electrical & Computer Engineering
Univ. of Minnesota
Minneapolis, MN 55455
Tel: 612-624-4116
Fax: 612-625-4583
Email: parhi@ee.umn.edu
1. Significant Accomplishments:
------------------------
A. Digit-Serial Building Blocks
During the last few months, the novel digit-serial architectures which were designed before have been modified so as to incorporate Booth recoding. Based on this a new modified Booth recoded type-II transformed digit-serial multiplier has been designed. This design has fewer number of adders and reduced critical path as opposed to previous architectures. In addition, the latency of these modified Booth recoded digit-serial multipliers is half when compared to those without recoding. The latency is reduced further by using Wallace tree based carry-save adders.
B. Programmable DSP:
B-1: Architecture and Instruction Set:
We have defined the architecture for the programmable DSP chip. The architecture consists of a program counter, instruction cache, VLIW register, register files, crossbar network, DSP cores, and data cache.
The VLIW register has 8 64-bit instructions packed together. Four register files are shared by two DSP cores. Each register file has 8 read ports, 6 write ports, and 64 32-bit registers. The last 4 registers of each register file are connected to a crossbar network. The crossbar network has 16 32-bit read ports. Therefore, each of the four register files can access the last four registers of any other register file. This helps communication between the register files.
The crossbar network is designed using decoders and simple transmission gates. Each DSP core has 2 multiply accumulate (MAC) units, one shifter, and two ALUs. Clock gating has been employed in each DSP core to reduce power consumption. As a result only the functional units which are active in the DSP core will be consuming power, and the others will be idle. In addition a local data cache and a forwarding unit have also been included in each DSP core. The forwarding unit helps to avoid data hazards within the DSP core. The three write ports of the register file are distributed as follows. One shifter and one ALU share one write port. The other write port is exclusive to the other ALU. The third write port is shared by the two MACs. As a result the DSP core can initiate only one MAC operation in any given clock cycles. However, as the MAC is based on a digit-serial design it takes four clock cycles to generate the result and therefore two MAC operations can be issued in different clock-cycles. As a result they will not try to write to the register file in the same clock-cycle and therefore avoid any data hazards.
We have finalized the instruction set of the proposed programmable DSP chip. We plan to have two types of instructions - a) two short RISC sub-instructions packed into one word b) one long RISC sub-instruction. The short instructions are 29 bits each and the long instruction is 61 bits wide. The advantage of this type of instruction set is that it enables 32-bit immediate type data operations. Each short sub-instruction uses 9 bits for the opcode, 14 bits for the two source registers, and 6 bits for the destination register. One source register is 6 bits wide and the other is 8 bits wide. The two extra bits for the second source register are used for accessing the crossbar network. The long instruction uses 9 bits for the opcode, 14 bits for the two source registers, 6 bits for the destination register, and 32 bits for specifying the immediate data. The instruction set can be classified into 6 types - load/store, logic, branch, arithmetic w/o saturation, MMX (multi-media extensions), and other DSP specific operations. The total number of instructions in the instruction set is 115. Some examples of short sub-instructions are ADD (add), SHL (shift left), ROL (rotate left), etc. Examples of long instructions include JMP (jump), LDBI (load byte immediate), ADDI (add immediate), etc.
B-2: Prototype of DSP Core:
As stated in previous report, a working prototype of DSP core is developed. This prototype is written in VHDL and supports all the basic functionality of DSP core, which we intend to support in the final version. This core has one ALU, one Shifter and one Multiplier. Final version will have 4 ALUs, 2 Multi- pliers, and 2 shifters. This prototype is synthesized using Mentor Graphics tools. This prototype works at 100 MHz for CMOS 1.2u technology. With technology scaling we expect to get much higher performance. Layout of this prototype is completed using Mentor Graphics' layout tool. The resulting area is 6.4 mm^2. This core supports data hazard detection and forwarding as intended for the final version.
B-3: Low Power and High Performance Memory Hierarchy Design:
Memory performance is always a critical issue in any processor design. With increasing gap between processor performance and memory, a good amount of research effort is directed at improving the performance of memory. Our architecture demands high performance and low power memory hierarchy design. Design of a high performance and low power memory hierarchy design should be done in two steps. First, a high performance and low power Memory needs to be designed and second, a good interconnection protocol is desired. We are attempting to solve the problem of memory design in this manner only.
Low-Power SRAM Design:
SRAMs are the basic building blocks of any cache memory. Hence designing a low power cache essentially means designing a low power SRAM memory. The major source of power dissipation in any SRAM is active power, i.e., power consumed during reading and writing of any cell. This power consumption is mainly due to charging and discharging of large bit line capacitances. Also, in large memory because of large word lines, capacitances associated with them are large. These large capacitances of word line and bit lines contribute to a major source of power consumption and delay. In literature, various techniques, like multi divided word line, are proposed to reduce the word line capacitances. To reduce the bit level capacitance, we have tried a novel idea of hierarchical combination of SRAM cells to reduce the bit line capacitance. Since the bit line capacitance is mainly contributed by the drain capacitance of the word-line select transistors of the SRAM cell, by combining these cells hierarchically, we intend to reduce the bit line capacitance.
We have connected two SRAM cells to the bit lines using one word-line transistor. To perform simulations, we have modeled one 16K X 4 bits SRAM memory for 1u CMOS technology. We have used divided word line scheme and arranged our memory in 4 blocks, each block with 8 sub-blocks of 512 rows and 4 columns. As we have combined 2 SRAM cells, we have 256 rows instead. Hence, our bit line capacitance is approx. half of that with 512 rows. In the performance comparison with standard one cell architecture, we get 25% power reduction and approximately half the access time for SRAM cell.
Cache Hierarchy Design:
The proposed DSP processor consists of 8 DSP cores operating in parallel. Each of these DSP cores should be able to access a data cache. Our architecture will have 8 caches connected to 8 DSP cores via a fast interconnecting switching fabric. This fabric will be controlled by a controller, which will decide the port allocation and resolve any conflicts. This approach increases hit time than that of caches local to every core, but it solves the cache coherency problem and also increases the cache memory size and reduces the bus activity. Reducing the bus activity will reduce the power consumption and increase in memory size will increase hit ratio.
C. Division/Square-Root Coprocessor:
A fully functional 8-bit prototype of a Shared Division and Square Root Chip has been developed using Mentor tools. The designed chip uses the Radix-2 SRT Division and Square Root Algorithm. Two different division algorithms namely Svoboda-Tung (ST) and Sweeney Robertson Tocher (SRT) Division Algorithms were evaluated in the course of the design. The SRT Division and Square Root Algorithms were chosen in view of the close similarity between the two, thus leading to maximum hardware sharing. Moreover HEAT simulations for power consumption indicate that the two consume similar quantity of power. The apparent advantage for ST Algorithm is in terms of speed due to the possibility of carrying out signed digit arithmetic, thus reducing the delay of the adder chain to a single Full Adder(FA). However a novel idea of implementation of a Divider using Carry Save Adders is being explored, which will allow the SRT Dividers to have only a single FA delay in their adder chains. In view of the above discussion and the ease of coding the SRT Algorithms, Radix-2 SRT Division and Square Root Algorithms stand out as the clear choice for implementation in the proposed chip.
The chip also uses various Error status signals to indicate status and type of error occuring in the operations. Handshaking signals are used for error acknowledgements (using single bidirectional ACK pin). Other standard signals such as Request, Reset too are provided on the chip. A novel idea proposed in the implementation is to carry out the operations asynchronously, but to make the control part of the chip synchronized to the global clock. The nice result due to this is that the change in quotient digit size has minimal impact on the frequency of the clock (though the latency of operations in terms of # of clock cycles will of course change).
The designed chip has been synthesised from VHDL code, which has been written in generic terms. The implementation uses an On-The-Fly converter that changes the signed digits of the quotient to the unsigned digit form. The critical path on the clock signal in the control part of the chip is ~12 ns, which implies a clock frequency of ~70Mhz (independent of the quotient digit size). The latency of operations for the 8 bit implementation is 50 ns.
D. CORDIC Coprocessor:
Improved power estimation of the CORDIC unit has been achieved through a complex multiple-step process recognized for its high reliability. This rather involved technique was undertaken with the expectation that enhanced power estimates based on refined capacitance values would not deviate significantly from the preliminary predictions. This anticipation was confirmed. However, the capacitance-extraction method provided a necessary check on previous calculations. Thus, this work has prepared a technical foundation for the next period's effort.
The CORDIC algorithm is still an area of intensely active research. Novel variants are proposed regularly by a variety of sources. Therefore, new modifications are continuously evaluated for suitability to this project as part of the ongoing effort to develop a product that will be at the forefront of current technology both now and at the time of delivery.
E. Digit-Serial FPGAs:
Several performance benchmarks for FIR filters and RSA cryptography were completed using digit-serial implementations on the Xilinx 4000-series FPGAs. In the case of the FIR filters, both unsigned and two's complement designs were implemented. Digit sizes of 1, 2 and 4 were evaluated, using unpipelined, digit-level pipelined and bit-level pipelined schemes. For a 5-tap filter, the results showed that the area-time product was optimized for a digit-size of 2 with digit- level pipelining. For RSA cryptography, a modified Montgomery modular multiplication design was implemented for a variety of word and digit sizes.
The major components of the basic cell for the digit-serial FPGA were designed. The cell consists of the following elements: bit-level add/logic units, carry-save and fast-carry logic and D-type flip-flops. The cell is being optimized for digit sizes up to 4, but larger digit-sizes can be accommodated by chaining two or more cells together.
2. Next Period Activities
--------------------
A: Programmable DSP:
In the next quarter, we plan to implement the proposed architecture using VHDL and Mentor Graphics tools to verify the functionality of the design. We also plan to run some benchmark programs to test the performance of the proposed architecture.
The SRAM design will be completed and its performance will be compared with other existing memory architectures. In addition, benchmarking for the overall memory hierarchy design will be undertaken.
B: Division/Square-Root Coprocessor:
The SRT Algorithms implemented currently have a delay of a full carry ripple chain for each adder chain. One way to get around this is to use Carry Look Ahead logic or implement multiplexer based adders. These implementations of course speed up the operations by a factor of ~8 but are costly to implement. Some other method that is as effective but less costly needs to be explored. A novel approach based on using carry save adders in the implementation of the divider is being explored. This would in the best case reduce the delay of the adder chain to one FA (~95% of cases) and in the worst case have the delay of a full carry ripple adder(~5%). The chip will also need to be made compatible to IEEE-754 numbers, which can be achieved in a trivial way. Another idea worth exploring is the implementation in digit-serial form instead of the current digit-parallel.
C: Cordic Coprocessor:
Algorithmic and architectural modifications presented in published and yet-to-be-published sources that are deemed beneficial to this project will be applied to the current design. Each alteration will be verified against several interdependent criteria including power dissipation, area consumption, intra-iteration speed, and iteration depth.
As the design progresses through several adaptations constant re-simulation will be required to insure an accurate prediction of power consumption. Thus, until the architecture is finalized, power estimation will continue to command a great deal of labor and attention.
D. Digit-Serial FPGA:
Domain-specific digit-serial libraries for COTS FPGAs will be constructed in the areas of signal processing and cryptography. A wider range of filters and transforms will be considered for the signal processing library, and elliptic curve cryptosystems will be added to the cryptography library.
The basic cell of the digit-serial FPGA will be optimized and simulation benchmark studies will be carried out. Also, investigations into routing structures for the digit-serial FPGA will be initiated.
3. Documentation
--------------
Y-N. Chang, J. H. Satyanarayana, and K. K. Parhi, "Design and Implementation of Low-Power Digit-Serial Multipliers", Proc. IEEE Int. Conf. on Computer Design (ICCD), Austin, TX, October 1997.
Y-N. Chang, J. H. Satyanarayana, and K. K. Parhi, "Systematic Design of High-Speed and Low-Power Digit-Serial Multipliers", submitted to IEEE Trans. on Circuits and Systems-II, June 1997.
The paper "FPGA-Based FIR Filters Using Digit-Serial Arithmetic" was accepted for presentation at the 1997 IEEE ASIC Conference (ASIC-97).
Page maintained by
cywang@ee.umn.edu , jsatyana@ece.umn.edu