In the era of digital signal processing, like graphics and computation systems, multiplication-accumulation is one of the prime operations. A MAC unit is a vital component of a digital system, like different Fast Fourier Transform (FFT) algorithms, convolution, image processing algorithms, etcetera. In the domain of digital signal processing, the use of normalization architecture is very vast. The main objective of using normalization is to perform comparison and shift operations. In this research paper, an evolutionary approach for designing an optimized normalization algorithm is proposed using basic logical blocks such as Multiplexer, Adder etc. The proposed normalization algorithm is further used in designing an 8 × 8 bit Signed Floating-Point Multiply-Accumulate (SFMAC) architecture. Since the SFMAC can accept an 8-bit significand and a 3-bit exponent, the input to the said architecture can be somewhere between −(7.96872)_{10} to + (7.96872)_{10}. The proposed architecture is designed and implemented using the Cadence Virtuoso using 90 and 130 nm technologies (in Generic Process Design Kit (GPDK) and Taiwan Semiconductor Manufacturing Company (TSMC), respectively). To reduce the power consumption of the proposed normalization architecture, techniques such as “block enabling” and “clock gating” are used rigorously. According to the analysis done on Cadence, the proposed architecture uses the least amount of power compared to its current predecessors.

In digital signal processing, the MAC operation is considered a significant and critical operation. The Digital Signal Processing (DSP) algorithms execute many mathematical calculations repeatedly and rapidly on various data sets. DSP algorithms can be effectively executed by the majority of operating systems and general-purpose microprocessors. Unfortunately, DSP algorithms have energy efficiency issues while operating with portable devices such as Personal Digital Assistants (PDAs) and mobile phones. Considering delay and power optimization, the exponential growth of portable electronics has imposed a major challenge to Very Large-Scale Integration (VLSI) design engineers. A MAC unit is a vital component of any digital system, such as various FFT algorithms, convolution etc. The actual MAC block is not just limited to the fixed-point number system. For audio and image processing applications, floating-point MAC architecture is much needed. MAC's simple operation is to multiply two variables (_{i}_{i}

The popularity of portable devices and the requirement to limit the power consumption (and therefore heat dissipation) in heavily-dense VLSI chips have resulted in rapid advances in low-power design over the past few years. Mobile applications necessitating low-power dissipation and high throughput, let us say notebook Personal Computers (PCs), mobile communication devices, and PDAs, are the driving forces behind these innovations. In most cases, low power consumption requirements need to be met along with equally challenging targets of high chip density and high speed. Therefore, the low-power IC design surfaced as a beneficial and fast-developing area of Complementary Metal Oxide Semiconductor (CMOS) circuit design. Usually, the restricted battery life places very stringent demands on the portable system's overall power requirements. New types of rechargeable batteries, say “Nickel-Metal Hydride (NiMH)” is being produced with better energy storage capacity than the traditional “Nickel-Cadmium (NiCd)” batteries. Still, there is no prospect of a significant increase in energy capacity in the foreseeable future. The energy density (the energy stored/unit weight) provided by new advancements in technologies (such as NiMH) is approximately 30 Watt-hour/pound, which is quite lesser considering the growing applications of portable systems. Scaling down the energy dissipation of Integrated Circuits (ICs) by improving functionality is, therefore, a significant task in developing portable devices.

In high-performance digital systems, such as microprocessors-microcontrollers, DSPs, etc., the need for low-power circuit development is also becoming a significant concern. Targeting higher chip density and higher processing speed contributes to developing a high-clock rate in very complex circuits. If the chip's clock speed rises, then the chip's energy dissipation, thereby increasing the temperature linearly. As the dissipated heat has to be efficiently removed to maintain the chip's temperature at an optimum level, the packaging cost, cooling, and heat extraction become important aspects. A few elite microchips structured in the mid-1990s (such as Intel Pentium, Digital Equipment Corporation (DEC) Alpha, PowerPC) which operates in a frequency ranging from 100–300 MHz, and the total average power is ranging from 20–50 W. VLSI's reliability is one more critical factor to look after for the design engineers, as it emphases to the demand for energy-efficient design. There is a near connection between electronic circuit maximum power-dissipation and reliability concerns like electro-migration and system degradation caused by the carriers. Additionally, the thermal stress caused by chip heat dissipation is also a significant issue to look after in terms of reliability. As a consequence, increasing power consumption is also critical for improving performance. The procedures used in digital systems to achieve low-power consumption vary from device to device, technology to technology or algorithm to algorithm level. The standard system features (say threshold voltage), device dimension and interconnection properties are essential factors in reducing power consumption. Circuit level approaches such as a careful selection of circuit design logic family, decrement in the total number of voltage transitions, and clocking approaches can be used to minimize transistor-level energy dissipation. Measures at the architecture level include intelligent power management of different system components, pipeline and concurrent usage, and bus layout design.

In recent years, different researchers have done several works [

As shown in

This manuscript is divided into six subsections: Section 2 explains the Exponent-Comparator-Circuit (ECC) & its operation. Section 3 describes the Exponent-Shifter-Circuit (ESC) & its operation. Section 4 describes the proposed SFMAC architecture using ECC & ESC architectures. Section 5 explains the comparison of the proposed SFMAC with the existing one. At last, the conclusions and future work are explained in Section 6.

The product of the input exponents and the previous cycle's output exponent are used as inputs to the ECC (Exponent-Comparator-Circuit). The most important thing to remember here is that difference between two ECC block's input is calculated as arithmetic difference, if both of the ECC block's input terms have the same sign. On the other hand, if both inputs have separate signs, the difference between the two is equal to the arithmetic sum of the two inputs.

Multiplexers are used in the architecture to compare the inputs. The ECC operation generates a 5-bit output used to execute binary shifts (as shown in

The ECC's inputs are expressed in 2's complement form depending on the input sign bits.

The operation of the ECC is further segregated based on the sign bits of the inputs as follows:

If both the sign bits are different, then add the inputs of the ECC to produce a 4-bit output (i.e., discard the carry bit) but introduce the 5th bit as ‘1’ if the product of the exponents of the inputs is negative, but the previous exponent is positive. Make the 5th bit as ‘0’ in the other circumstances.

If both the sign bits of the inputs to the ECC are the same, then find out the input which is higher among the two and find the difference between the inputs as per the following procedure:

To find the higher number, compare both the numbers bit by bit, i.e., start comparing MSB to LSB, as shown in

For finding the difference, use the 2's complement approach. The difference produces a 4-bit output (i.e., discard the borrow bit) but introduces the 5th bit as ‘0’ if the product of the exponents of the inputs is higher than the previous cycle exponent. Make the 5th bit as ‘1’ in the other circumstances.

In this architecture, multiplexers are used to compare the inputs.

This method yields a 5-bit output that is utilized to do binary shifts in the ESC block.

The ESC (Exponent-Shifter-Circuit) block is in charge of shifting the smaller number by an amount of the difference between the exponents of the product of the 8-bit inputs and the previous cycle MAC output (preceding output). The ECC block's 5-bit output, a 16-bit product of the inputs, and the previous cycle's 16-bit output (preceding output) are the ESC block's inputs. The multiplexer-based design of the ESC block is shown in

Based on the ECC result, the smallest number is identified (5-bits). If the MSB of the ECC block output is 1, the product of the inputs is moved to the right by the corresponding decimal value of the ECC block output's remaining 4-bit binary. If the MSB of the ECC block output is 0, the preceding output is moved to the right by the corresponding decimal value of the ECC block output's remaining 4-bit binary.

The MSB of the ECC block output also identifies the input to the ESC block, which does not need shifting. If the MSB of the ECC block output is 1, the previous output is retained (not shifted). If, on the other hand, the MSB of the ECC block output is 0, the product of the inputs is passed in its entirety (not shifted).

To represent positive and negative numbers, the architecture employs sign-magnitude and 2's complement representations. Signed magnitude form is used to describe SFMAC input-output, but these inputs are converted to 2's complement form for the internal calculations. The proposed MAC architecture's final output (MAC output) has 17 bits, including one sign bit.

The SFMAC's inputs are two 8-bit binary numbers formatted as shown in

As a result, the exponent term in this architecture will vary from ‘−4’ to ‘+3’. The input numbers will range from −(0.11111111)_{2} × 2^{+3} to +(0.11111111)_{2} × 2^{+3} & hence the new SFMAC architecture's inputs range from −(7.96872)_{10} to +(7.96872)_{10}. Furthermore, the SFMAC architecture's inputs can only be entered in fractions. For instance, the numbers (001)_{2} & (010)_{2} should be entered as (0.00100000)_{2} × 2^{+3} & (0.0100000)_{2} × 2^{+3} respectively as the inputs to the SFMAC. Similarly, (101)_{2} & (10)_{2} should be represented as (0.10100000)_{2} × 2^{+3} & (0.10000000)_{2} × 2^{+2} respectively to process it through the SFMAC. The 8-bit multiplier, 16-bit register, 16-bit adder, 2:1/4:1 multiplexer of various sizes, and Exponential Adder are the main building blocks of the SFMAC architecture (other than the Exponent Comparator Circuit (ECC) and Exponent Shifter Circuit (ESC) explained earlier). SFMAC's overall architecture is depicted in

CMOS technologies are used to develop and execute the overall SFMAC architecture. A thorough study is carried out using the Cadence Virtuoso. To limit the power consumption, the architecture employs a “clock gating scheme” and a pipeline mechanism. The clock pulse pipeline system is ensured by triggering successive blocks after a predetermined period.

The SFMAC architecture is implemented in 90 and 130 nm CMOS technology (GPDK and TSMC, respectively). _{Average}_{sim}_{clk}_{DD}_{Average}_{load}_{T}

Architecture | _{Static}_{DD} |
_{Average}_{DD }=_{ }2 V and _{sim} |
Transistor count |
---|---|---|---|

SFMAC at 90 nm (GPDK) | 476.94 μW | 7980 μW | 25783 |

SFMAC at 130 nm (TSMC) | 2398.76 μW | 25990 μW | 25783 |

Although there are architectures that use clock signals just for data accumulation (in the register or accumulator), most of the architectures in the literature do not use any clocking signals. Asynchronous circuits do not have real-time applicability. As a result, the architecture's functional applicability must be further investigated. The architecture shown in [

Although there are architectures that use clock signals just for data accumulation (in the register or accumulator), the majority of the architectures in the literature do not use any clocking signals. Asynchronous circuits don't have real-time applicability. As a result, the architecture's functional applicability must be further investigated. The architecture shown in [

Serial number | Proposed architecture | Details | Tool/HDL used | Power dissipation | |
---|---|---|---|---|---|

1 | SFMAC | Signed floating-point MAC architecture in 90 nm tech., 2 V at 83.33 MHz & 8 × 8 bit operation | Cadence virtuoso 90 nm CMOS | _{Static} |
_{Average} |

0.476 mW | 7.98 mW | ||||

Serial number | Already reported | Description | Tool/HDL used | Power dissipation | |

1 | [ |
Pipelined multiply accumulate unit (fixed-point) in 180 nm technology, 1.8 V at 83.3 MHz & 8 × 8 bit operation | Cadence Virtuoso | 50.26 mW | |

2 | [ |
Multiply Accumulate Unit (fixed-point) in 180 nm technology, 1.8 V at 217 MHz & 64 × 64 bit operation | Verilog HDL | 177.732 mW | |

3 | [ |
Pipelined multiply accumulate unit (fixed-point) in 65 nm technology, 1.1 V at 591 MHz & 16 × 16 bit operation | VHDL | 8.2 mW | |

4 | [ |
Multiply accumulate unit (fixed-point) in 90 nm technology, 1 V at 100 MHz & 16 × 16 bit operation | HDL in Cadence's HSPICE simulator | 1.506 mW | |

5 | [ |
Pipelined multiply accumulate unit (fixed-point) in 180 nm technology, 1.8 V & 8 × 8 bit operation | HDL in Synopsys Design Compiler | _{Static} |
_{Dynamic} |

2.010 mW | 3.627 mW | ||||

6 | [ |
Multiply accumulate unit (fixed-point) in 180 nm technology, 1.8 V at 5 MHz & 16 × 16 bit operation | Verilog HDL | ||

493.648 mW | 1765.241 mW | ||||

7 | [ |
Multiply accumulate unit (fixed-point) in 32 nm CMOS & CNTFET technology & 1 × 1 bit operation | |||

0.9902 mW | 0.6335 mW | ||||

8 | [ |
Fixed/floating-point multiply accumulate unit in 90 nm technology for 16-bit half-precision multiplication | VHDL | 14.07 mW |

A novel approach for performing normalization is explained in this paper. The proposed normalization operation is categorized into Exponential Comparator Circuit (ECC) & Exponential Shifter Circuit (ESC). The ECC block performs a comparison between the exponents; at the same time, ESC is responsible for shifting the smaller number by the amount of difference between the exponents of the inputs. Further, a signed floating-point MAC architecture is also proposed using the novel normalization architecture. For design & implementation, the Cadence Spectre tool is used at CMOS 90 nm and TSMC 130 nm technologies. The results have proved that the proposed SFMAC architecture has used the least power than its recent counterpart & therefore, has applicability in low-power DSP architectures.

^{n}+ 1 multiplier