Multiplier Project

CMPE222 VLSI design
Professor Matthew Guthaus
Fall 2008
From: Marcelo Siero

This project involved developing a simple 16-bit signed multiplier designed using a scalable 180nm process. This report delineates the approach to this design and some of the tools that were created by me to put the project together. The project was designed using a Cadence based toolset involving: Virtuoso: Schematic Capture Diva: DRC, LVS Spectre: circuit Simulation MDL: Measurement Design Language Viva: Simulation Display. This was suplemented with: iVerilog/Perl: Provide golden data model and early verification for specific multiplier model. Perl: Automation of Verification and Timing Extraction. The project was implemented in a manner leading to the development of a tool kit for automation related to the design of array structures. One of my personal goals was to use this experience as a means to generate a tool kit to make it easier to design any array structure (multiplier, RAMs, etc), from the stand point of: 1. Automatic verification based on the use of co-simulation. 2. Automatic extraction of timing elements: Rise Time, Prop Delay, Desired Clock Frequency. 3. Automatic optimization for: Vth, Vdd at various temperatures. In our report we first show a Virtuoso schematic of the resulting multiplier. The top level view of this schematric can be summarized in 3 pages. These show the various buffers, the negation system, etc.

Additional pictures show the heart of the project, i.e. the multiplier array. both the schematic and layout are shown

The top level in the hierarchy is the ckt called "full_mult" which is at the top of the hierarchy:
full_mult: contains 3 main subsystems:

in_reg16_neg_buf There is two instances of this. Contained within it are the 16-bit registers that are intended to house the incoming X and Y data, also the two 16-bit negators, and and the buffers that drive the Xn and Yn lines across the all the AND product functions of the multiplier (actually 2-input) NOR gates.

out_neg_reg32 the output negator and storage register.

mult_array - The array multiplier, which does all the unsigned multiplications.

"full_mult also contains the logic to determine whether the output 32-bit result requires negating or not, and activates such negation.
Full_mult is the top of the ckt, and was used to generate a netlist for simulation. See section with regards to simulation at a later point.

Full Mult

in_reg16_neg_buf Circuit.

out_neg_reg32 Circuit.

Now, we get into more detail going over the array multiplier. We start out with a picture of the multiplier. The multiplier implements a simple sum of subproducts without much optimization (no carry save, no look-ahead carry ckts). The layout is a fairly tight ckt. The ckt has been implemented to use only in 3 layers of metal.

The array multiplier.

The following shows the layout of the array multiplier. A lot of time was spent trying to get this layout to be LVS clean. The array came short of being LVS clean. Even though all the submodules were LVS clean, LVS found errors in the middle of the array, which at this point is a mystery that requires more troubleshooting. Power and ground are shared in a mirrored cells along the horizontal axis to create a tighter layout.

The array's size is 219u x 100.93u. The stepped size of each basic cell of the array (i.e. a full adder and undersized nor gate) is: 13.3u by 6.66u.

Next we can see the array in schematic form. A variety of Virtuoso schematic features were used to implement this schematic. A brute force showing all the cells turned out to be the easiest to use.

mult_arr_Schematic

Here we see a closeup of the same schematic, which indicates the detailed names. Throughout this project we worked at finding an approach to the various aspects of design that minimized manual tasks related to the size of the array. Using bussed schemes with Virtuoso made many tasks a lot simpler. That is to say, using buses for schematics, and iteration of pin placement in the schematics and the layout simplified that task considerably.

mult_arr Schematic up-close

The mult_quad_cell was built first to resolve the interface between cells. Good advantage was taken of "Edit-in-Place" to make a tight design using only 3 metal layers for this cell. That leaves plenty of metal layers for power routing and other uses.

The quad cell is essentially a repeat of the very simple cell basic_mult_cell consisting of a full-adder and a small nor-gate, which is also mirrored around the gnd line.

The entire array gets summarized by this symbol in the full_mult top level diagram, taking advantage of bus notation.

mult_array_cell

The earlier symbol is actually a simplification of the symbol below, which illustrates how the pins were actually placed in the layout. We were trying to do this in a manner that avoided manual placement of pins (e.g. as if we were doing a 54-bit multiplier). Because of the sharing of power and gnds, it was not possible to get an even span for sequential pins, so we divided the pins into an even and odd. This diagram shows how patchcords can be used cleverly in Virtuoso to merge multiple buses and get aroudn this problem., the symbol given previously corresponds to this schematic.

mult_array_simplified

About Buffer Sizing

A very important part of making the multiplier run fast are the buffers. Driving large capacitive loads with small inverters doesn't cut it. Let me go over the approach used to size our buffers. Unfortunately, getting an LVS clean array and getting to the point where we are ready to simulate took a long time. We have created some tools to make multiple simulations and optimization efficient, and we are ready to do that, but we havent done it yet. So stay tuned, since we will be running these simulations. The buffers were sized to scale up to the large capacitance provided by the array. We used the following method to calculate the size of the buffers. We estimate the total W (gate width) driven by the signal (assuming that L is the same for all devices and wont affect the load). If we consider that a Fan out of 3x the cap of your input is an efficient scale up we have that each stage can drive this:

           Stage      W of Load:
           ---------------------
	     1: 	3.24
	     2: 	9.72
	     3: 	29.16
	     4: 	87.48
	     5: 	262.44
	     6: 	787.32
	     7: 	2361.96
	     8: 	7085.88
	     9: 	21257.64
	     10: 	63772.92

We are using 3 different buffers in this design for scaling up Fan out:

1. d_ibuf
These buffers are used to drive the NOR2 gates that make up each subproduct. These NOR gates were made undersized to cause smaller loads. Each input, thus, consists of: a W of 1.08u for the P device, and a W of .270u for the N device. i.e. a total of: 1.36u. There is a total of 16 of these being driven by _Xn and _Yn signals. That is: 21.76u worth of width. Using the table above this indicates a scale up of about 3 stages.

2. clk_ibuf
These buffers drive all the registers. The system currently has to drive 64 Master slave FF's. I am using high leakage FFs with two half-xmission gates per latch (very low load). Each FF puts a total W load of: 1.08 * 2 or: 2.16 W on its clock. So 64 * 2.16 => 138.24u of W. This will translate to 4 or 5 stages. Here we could experiment a bit if we had time to optimize this. We will use 5 stages for the time being. We could allow the NOR gate to be the first stage, but for simplicity sake we will ignore it, and start from the first inverter. This could be optimized some.

3. ibuf_mux_ctl
These buffers are each used to drive the pos or neg select line of 16 Muxes. Each Mux then consists of a PMOS and an NMOS device or a total of 1.08. Thus by driving 16 stages it is a total W of: 17.28. So we will choose 3 stages for these buffers. p> Below is a picture of the schematic for the 5 stage clock buffer.

clk_ibuf

We use instance iteration to make the schematic appear simple. Here we have 16-plex symbol for the inverting d_ibuf. We only have to change the d_ibuf element to change all of them. data_buf

Other Peripheral Logic Needed.

We basically implemented the half-adders for the negator by tying one leg of the full-adder to gnd, in order to save on time. Implementing real half-adders in addition to the full-adders that were already deisgned, was an optimization that could eventually be done, but it wasn't high on the priority list for this toy project.

Since the adder cannot be iterated because of the carry chain, we implemented this with a 4-bit adder, instantiated 4 times hierarchically.

half_adder 16-bit array

As can be seen all that was necessary was to tie one of the legs to ground.

half_adder

For registers we used a pass-gate implementation. We don't expect that this is the best choice, but it creates a very small FF, and it is fun to play with an alternate approach. The main problem with this approach is that it will of course be slower and more leaky.

Pass-gate M/S FF

simple MS FF

The two way 1-bit MUX shown here is of course a very useful flexible ckt. It was used to implement the negator, and also to implement an XOR gate to help determine whether the output needs to be negated. We transfer select and select bar, to the outside in case both signals already exist.

Here we show how we turn this into a 16-bit 2-way Mux with an iterated symbol.

mux_16x

Note that a lot of the schematics elements were implemented using the iterated symbols feature of Virtuoso, which simplifies the schematics considerably. This features also provides a clever scheme to connect an array of modules to created buses. A single wire going to a bus or array of pins, connects to ALL the pins of the bus or array of pins (and vice-versa), a bus must otherwise be the same size as the sum total of the number of pins of the iterated object times the number of pins of the port in that object. Typically this is a 1-bit port of an iterated object, so that the bus corresponds to the number of objects iterated. Arrays of 16 NOR gates, or 16 Muxes, or 16-buffers can be described using a single device. Thus the notation of U0<15:0> indicates 16 instances of that device. The interconnect strategy described above does the rest. Other Virtuoso features like patchcords, allow you to have two aliases for the same bus. I used these at times to help create the schematic. For complex elements like the multiplier array where you try to implement a 2-D structure with somewhat complex interconnect the approach of using iterated instances and patchcords seems to breaks down. I tried for a while to implement it this way and ended up very frustrated after a lot of effort, so I used the brute schematic approach instead, which has the benefit of graphically showing what is going on, but could create a very big schematic depending on the size of the multiplier. For a very large multiplier we might want to make a hierarchical schematic so that it can be readable. There may be bugs in the iterated/bus connect approach for very complex cases.

Results

The following were the resulting specifications achieved by this project with a non-carry save straight-forward multiplier and no carry-look-ahead anywhere:

Clock speed work with non-backannotated design for patterns given: 146.7Mhz.
Period of clock: 6.8164 ns.
Power with enable set: 1.952460 mw.
Power with enable not set: 1.379626e-04 mw.
Projected area of finished project:
27817 sq. u. = 27817 / 1000000 = .028 mm**2

So as a figure of merit for this design we can calculate the following:
power * area * delay **2 = 1.9524 * .028 * (6.82 ** 2) => 2.54 mW*mm^2*ns^2

The clock speed/delay was extracted as explained below based on set of test vectors working as described in the next section.

Transient Power Curve

Below is a VIVA (Cadence Display System) plog showing the transient power when running the test vectors for this design.

Verification and Time Extraction Methodology.

The verification and timing were performed/extracted as follows: First we:

Created a Verilog model from a Multiplier parameterized PERL program.
SPICE netlist derived via SPECTRE from Virtuoso.

From this an internally developed PERL/MDL program was written that extracts digital values generated at safe time (currently the following rising edge of the clock for sampled data) on the SPICE model. The PERL code adjusts for the pipeline effect and compares the multiply SPICE extracted digital result against the golden result (based on a PERL/VERILOG golden model). This results in a PASS/NO PASS test, that establishes proper functionality.

Based on the PASS/NO PASS test a binary search is performed to find the best clock period at which the PASS/NO PASS fail still succeeds to a resolution of 25 ps.

The timing was based on testing against a set of test vectors. Note that the clock specified is not necessarilly the worst case clock. We need to find the worst possible test vector, and then rerun the timing extraction shown.

The above system identifies which is the worst case test vectors of the ones given. The binary search is seeded with two known initial boundary times (a failing time, and a success time). After 6 iterations the following vector out of the set given was proven to be the bottleneck, the failure and success results are provided by the program's log:
LAST BAD RESULT:
Vec 5: 0000000011111111 1111111100000000 11111111111111100000000100000000
Should have been:
Vec 5: 0000000011111111 1111111100000000 11111111111111110000000100000000
LAST GOOD CLOCK PERIOD: 6.81640625 ns,
LAST BAD CLOCK PERIOD: 6.8046875 ns,

Thus identifying an operational clock period given this estimation of a worst case path.

Other Related Accomplishments.

Design project accomplished in relation to this project:

A schematic for a full signed multiplier was implemented.
A DRC clean layout was made for the array, and most of the multiplier. The array was made 98% LVS clean.
The resulting schematic was verified, and timing and power was determined.
Area for the design was estimated.

CAD software developed in relation to this project:

Successfully Wrote a Verilog code generator for the simulation of basic and carry-save multipliers.
This program was written to generate simulation code for multipliers of arbitrary size. Next we will implement similar code for a Booth multiplier, which is clearly a better performing.
Wrote some Perl programs to automate the simulation process. The program uses AMS's MDL (Measure Design Language) for some of its tasks, and parses some of MDL's output. The program will next be adapted to work on any simulator by using standard ascii output formats for SPICE instead. This program does the following:
- Generate Piece-Wise-Linear stimulus with estimated rise and fall times.
- Extracts the binary values of signals at given points in time in the simulation.
- Does a binary search to find prop delay based on a decision (True or False function).
- Finds rise/fall-time at various locations in the ckt.

The programs was integrated into a system that automatically verified the implemented multiplier and determined the resulting timing.

Overall UCSC's CMPE222 was a great class.