LoBA: A Leading One Bit Based Imprecise Multiplier for Efficient Image Processing

(1)

https://doi.org/10.1007/s10836-020-05883-4

LoBA: A Leading One Bit Based Imprecise Multiplier for Eﬃcient

Image Processing

Bharat Garg1 · Sujit Kumar Patel1· Sunil Dutt2

Received: 27 December 2019 / Accepted: 1 May 2020

Abstract

Several applications such as signal processing, multimedia and big data analysis exhibit computational error tolerance. This tolerance can be exploited to achieve efficient designs by sacrificing accuracy. Therefore, approximate computing presents a new design paradigm that smashes the traditional belief of error-free computations and provides efficient design with quality metrics specific to an application. The multiplication operation significantly determines the performance of the core due to the compute intensive operation. Therefore, this paper proposes a novel leading one bit based approximate (LoBA) multiplier architecture that selects k-bits from n-bit inputs (k ≤ n/4) based on leading one bit (LOB) and then computes approximate product based of these small input. The accuracy is further improved by selecting next k-bits based on LOB position and considering the partial product for computing final product. Four imprecise LoBA multipliers are presented that provide trade-off between accuracy and performance. Finally, the effectiveness of the proposed architectures is shown over the existing multipliers as standalone arithmetic unit and in the application by implementing Gaussian smoothing filters. The proposed 16-bit LoBA0 and LoBA1 designs reduce power consumption by 64.2% and 32.9%, respectively over the existing multiplier architecture.

Keywords Approximate computing· Multiplier architectures · Quality-energy tradeoff · Error resiliency

1 Introduction

The modern devices embedded with compute intensive multimedia applications require energy efficient signal processing cores. To achieve high energy efficiency, the computing units within these devices should have high performance and low power consumption. However, several techniques reduce the power consumption but reduce the performance as well i.e. provide trade-off between

Responsible Editor: S. T. Chakradhar

Bharat Garg

[email protected] Sujit Kumar Patel [email protected] Sunil Dutt

[email protected]

1 _{Thapar Institute of Engineering and Technology, Patiala, India} 2 _{Indian Institute of Information Technology Vadodara,}

Vadodara, India

power and delay. The emerging technique that resolves this problem is approximate computing where power and performance can be improved simultaneously with an acceptable loss of quality [18]. The applications such as multimedia applications which provide output for human consumption are known as error tolerant applications. The resiliency/tolerance in these applications occurs due to redundancy in input data, iterative computations (where error occurred in first stage is healed in forthcoming stages) and non-existence of single golden output. Therefore, approximate design has appeared as recent design technique to achieve efficient designs for error tolerant applications [17].

Significant research has been conducted to achieve high performance low power adders and multipliers using approximate computing [7]. These approximate adders/multipliers can be efficiently utilized in the exist-ing accurate signal processexist-ing cores (e.g. Filters) to achieve energy efficient signal processing [2]. Most of the research-efforts for achieving high performance approximate adders reduce critical path delay by truncating carry propagation. These approximate adders are achieved by either employing approximate full adder in few least significant bits (LSB) or

(2)

by computing the carry-less approximate sum for few LSBs. Further, several segmentation based approximate architec-tures are presented that provide trade-off between perfor-mance and quality/accuracy [24]. Along with the approx-imate adders, several accuracy reconfigurable adder archi-tectures are presented by employing error detection and correction (EDC) unit with approximate adder to achieve accurate sum if required with small performance/power penalty [5,10].

Recently, researchers have paid more attention toward developing energy-efficient approximate multiplier archi-tectures as they also significantly affect the performance of processing cores. The conventional multipliers contain three stages, (a) partial product (PP) generation, (b) compression of these PPs to two rows, (c) addition of these two rows to achieve final product. Most of the approximate multipliers are designed by employing (a) approximate partial product generator or (b) approximate compression unit and/or (c) approximate adder in last stage. An approximate multiplier employing approximate adder cells to achieve high perfor-mance and having OR-gate based partial recovery logic is presented in [14]. Further, a dynamic range unbiased multi-plier (DRUM) that reduces the length of multimulti-plier and mul-tiplicand to fixed k-bits is presented in [8]. The reduction of input operand bits significantly reduces implementation complexity with power and delay. Further, [23] introduced a rounding based approximate (RoBA) multiplier where inputs are first approximated to closest power of two value and then approximate multiplication is performed by few add and shift logic only. Furthermore, a scalable approxi-mate multiplier (TOSAM) architecture based on truncation and rounding is introduced in [21] where input operand are rounded and truncated to achieve approximate product with reduced implementation complexity.

This paper presents a new leading one bit (LOB) based approximate (LoBA) multiplier architecture that computes approximate product from the significant k-bit operands. Further, three more LoBA architectures are presented that determine two fixed k-bits based on their LOB positions from each operand and compute the approximate product from them. These LoBA multipliers reduce implementation complexity while providing trade-off between performance and accuracy. The prime offerings are as follows:

– The paper presents a comprehensive comparative analysis of the state-of-the-art multiplier architectures. – The paper also presents four high performance

approx-imate LoBA multiplier architectures.

– Finally, the performance of proposed multipliers is evaluated on real application by implementing Gaussian smoothing filters over the existing multiplier architectures.

The rest of the paper is organized as follows. Section2

provides an exhaustive literature review on the approximate multiplier architectures. The proposed LoBA multiplier architectures and their comparative analysis are presented in Section3. Further, the simulation environment and result analysis is provided in Section 4. Finally, the Section 5

concludes the paper.

2 State-of-the-art Approximate Multipliers

This section presents review on the prior work to achieve high performance approximate multipliers. Khaing

et al. presented an error tolerant multiplier (ETM) where inputs are segmented into accurate and approximate parts containing few most significant bits (MSBs) and remaining LSBs respectively [12]. These accurate and approximate parts are multiplied in accurate manner using conventional array multiplier and approximate manner using partial product less approximate logic, respectively. To reduce the amount of error without performance overhead, Garg

et al. presented an approximate multiplier by computing approximate product using AND-OR logic for few LSBs [3]. Kulkarni et al. presented an under-designed 2-bit multiplier by approximating the multiplication of 3 × 3 to 7. The under-designed/accurate multipliers are utilized to generate partial products for LSB/MSB to achieve multiplication of large bit-width operands [11]. Further, Garg et al. presented an accuracy configurable multiplier architecture which comprises of an approximate multiplier and an EDC logic to achieve approximate product with high accuracy without large power/performance overhead [6].

However the ETM [12] improves the performance over the accurate multiplier, it exhibits large error as the design is input independent. A logarithmic multiplier based on the approximate calculation of the logarithms and anti-logarithms is reported in [16]. This multiplier can compute logarithm and anti-logarithm iteratively or non-iteratively. The precision of the non-iterative designs is often fixed while for the iterative designs results are refined until a desired level of accuracy is achieved [1,15]. To increase the accuracy of the non-iterative approaches, several methods have been proposed. To reduce the amount of error, Narayanamoorthy et al. presented a dynamic segmentation method (DSM) that computes approximate multiplication using m-bit operands starting from LOB of multiplicand and multipliers. However, the DSM multiplier presents a constant non-zero mean error and leads to large error if employed in applications where errors are accumulative [19]. Further, Hashemi et al. introduced a dynamic range unbiased multiplier that provides imprecise product by multiplying m-bit operands where (m −

(3)

1)-bits are selected based on LOB position and LSB bit is set to logic ‘1’ [8]. In low energy truncation-based approximate multiplier (LETAM) [20], the input operands are truncated and in the multiplication step, half of the partial products are omitted. Therefore, the delay and power consumption is improved compared to those of the DSM and DRUM structures. To reduce implementation complexity further, RoBA multiplier first rounds inputs to neighbouring power of two value and then computes multiplication with the help of some add and shift logic [23]. The RoBA reduces large number of partial products summation to few additions to achieve final product which results in improved energy efficiency. Recently, Vahdat

et al. presented a scalable TOSAM multiplier that first computes rounded value of input operands based on LOB position and then computes approximate product using add, shift and small-width multiplier [21]. In TOSAM, the operands are rounded to closest odd value for decreasing the amount of error and the width of the rounded operands determines the accuracy performance trade-off. However, in TOSAM, the width of truncated operand cannot be changed dynamically; therefore, it cannot provide run-time accuracy reconfigurability.

3 Proposed LoBA Approximate Multiplier

Architectures

The objective of the proposed multiplier is to achieve large bit-width multiplication using small fixed width multiplier to reduce time and power consumption while controlling the amount of error. To achieve this, significant k-bits (where k ≤ n/4) represented by AKH and BKH are

selected from n-bit input operand A and B based on their LOB position. These AKH and BKH are given to k-bit

multiplier to achieve partial product where final product is achieved by shifting partial product by appropriate amount. Since the size of the multiplier is very small over n-bit multiplier, it significantly reduces the area, power and delay metrics over the accurate multiplier. Further, this method provides much higher accuracy than the direct truncation of LSBs as it captures more noteworthy (non-zero) bits. In this method, the remaining input (A− AKH) which

is truncated represents the truncation error. The accuracy of the proposed multiplication approach can be further improved by computing significant non-zero bits from the truncated number (A− AKH). Therefore, to reduce the

amount of error, we compute another k-bits (AKLand BKL)

based on LOB position from the truncated part of input operands. Now, we compute more partial products from the captured significant bits (AKH × BKL, AKL× BKH

and AKL × BKL) and add to the approximate product

(AKH × BKH). The consideration of more partial products

leads the resultant product close to accurate value at the cost of increased overhead.

In this paper, we present four approximate LOB based approximate multipliers namely LoBA0, LoBA1, LoBA2 and LoBA that consider one (AKH × BKH), two (AKH ×

BKH and AKH× BKL), three (AKH× BKH, AKH× BKL

and AKL× BKH) and four (AKH × BKH, AKH × BKL,

AKL×BKHand AKL×BKL) partial products, respectively

to compute the final approximate product (A× B). The mathematical expressions for the 16-bit multiplier are given below. Let, AKH represents most significant 4-bits of A

based on LOB while ka1represents the LOB position. The

maximum value of ka1is 15 (for 16-bit input) and minimum

is 3. The values of AKH and BKH in terms of A and B,

respectively are given by

AKH = A >> 2ka1−3and B= BKH >>2kb1−3 (1)

Further, the value of AKLand BKLare given by

AKL= [(A << 2n−ka1+3) >>2n−ka1+3] >> 2ka2−3

BKL= [(B << 2n−kb1+3) >>2n−kb1+3] >> 2kb2−3 (2)

Where ka2(kb2) represents the LOB position in the

remainder of operand after extracting of AKH(BKH) from

A(B). The output product from the LoBA0 (P0) is given by

(AKH× BKH).2K1where,

K1= ka1− 3 + kb1− 3 = ka1+ kb1− 6 (3)

Then, product (P0) is given by

P0= (A >> 2(ka1−3))(B >>2(kb1−3))2K1 (4)

P0= [(A >> 2(ka1−3))(B >>2(kb1−3))] << 2(ka1+kb1−6)

(5) Similarly, the values of P1, P2and P3are given by Eq.6

P1= (AKH× BKH).2K1+ (AKH × BKL).2K2

P2= (AKH× BKH).2K1+ (AKH× BKL).2K2+ (AKL× BKH).2K3

P3 = (AKH × BKH).2K1+ (AKH× BKL).2K2

+ (AKL× BKH).2K3+ (AKL× BKL).2K4 (6)

where the value of K1, K2, K3 are given by Eq.7.

K2= ka1− 3 + kb2− 3 = ka1+ kb2− 6

K3= ka2− 3 + kb1− 3 = ka2+ kb1− 6

K4= ka2− 3 + kb2− 3 = ka2+ kb2− 6 (7)

The proposed multiplication approach is explained with the help of example shown in Fig. 1. For simple

(4)

Fig. 1 An illustration proposed multiplication approach

A

KH

B

KH

A

KL

B

KL

A

B

A

KH

B

KH

A

KH

B

KH

b1

k =13

k =7

b2

k =14

_a1

k = 7

_a2

0 0 0

1 0 0 1

0 0

0 0 0

1 0 1 0

1 0 0 1

1 1 0 0

1 1 1 1

1 1 1 0

B

A

Accurate

X

=

(23,39,23,482)

₁₀

=

_{(23,16,28,000)}

DRUM

X

=

1 0 0 1 0 0 0 1

10

=

LoBA1

X

=

_{1 0 0 1}

_X

_{1 1 1 0}

+

B

KL

+

=

LoBA3

A

KL

X

B

KL

=

_{1 1 1 1}

_X

_{1 1 1 0}

+

(23,06,21,184)

=

LoBA2

A

KL

X

=

_{1 1 1 1}

_X

_{1 1 0 0}

(23,35,70,304)

(23,36,24,064)

10 10 10

=

_{(22,64,92,416)}

10

1 1 0 0 0 0 1 1

<<13 << 21 << 15 << 14 << 8

B

t

X

t

A

LoBA0

X

=

_{1 0 0 1}

_X

_{1 1 0 0}

analysis two 16-bit numbers (A=0100100011111001 and

B=0011000011101010) are considered. The value of two 4-bit numbers (AKH = 1001 and AKL = 1111) with their

corresponding LOB positions (ka1 = 14 and ka2 = 7)

from operand A are extracted. Similarly, the value BKH =

1100 and BKL = 1110 with their corresponding LOB

positions (kb1 and kb2) are 13 and 7 respectively. While

computing the partial product for LoBA0, only terms AKH

and BKH are considered and their product is evaluated

(AKH × BKH). This partial product is shifted left by K1

(k₊a1− 3 k₌b1− 3 11+10 = 21) to achieve approximate

product.

In order to improve accuracy, in LoBA1, partial product corresponding to term AKH and BKL is also considered

and corresponding product terms are computed and added to the approximate product achieved in LoBA0. Similarly, in LoBA2, approximate product from the terms AKL and

BKH is computed (AKL × BKH << K3) and added

product from LoBA2. Finally, in LoBA3, all possible partial products are computed and added to achieve product with very high accuracy. From the figure, it can be observed that the accuracy increases as we move from LoBA0 to LoBA3.

The architecture of proposed LoBA0 is illustrated in Fig.2. It consists of leading one detector (LOD) based k-bit extractor that computes k-bit output (Ak) and LOB position

(ka). The shift signal generation (SSG) unit computes the

appropriate value of the shift value (k) based on value of

ka and kb. Finally, the partial product (AK × BK) and

multiplication (P ) are achieved by k-bit multiplier and barrel shifter, respectively.

Similar to the LoBA0, the architectures corresponds to LoBA1, LoBA2 and LoBA3 are illustrated in Figs.3, 4and

5, respectively. LOD based extractork−bits SSG Unit AK LOD based extractork−bits BK *BK K A Barrel Shifter Ka Kb n A n B P

(5)

BK AKH Barrel Shifter

+

AKL * AKLBK * AKHBK LOD based extractor two k−bits LOD based extractor two k−bits SSG Unit K1 K2 P Barrel Shifter Ka2 Kb1 Ka1 n n B A

Fig. 3 Architecture of proposed LoBA1 multiplier

The next section evaluates the effectiveness of proposed LoBA multiplier architectures on the basis of quality and design metrics.

4 Simulation Results and Analysis

This section first presents performance analysis followed by quality analysis of proposed multipliers over the existing approximate multiplier architectures.

4.1 Performance Analysis

The architectures of proposed and existing multipliers are first coded in Verilog and then synthesized with Synopsys Design Compiler using 65nm PDK for performance analysis. The truncation length used in the DRUM design is half of the input operands bit-width. The synthesis results (area, power, delay and energy) are computed as shown in Table 1. From the synthesis results, it can be observed that the 16-bit RoBA [23] and proposed LoBA0 consume maximum and minimum energy respectively. Further, the proposed 16-bit LoBA0 and LoBA1 require 37.9% and 3.1% reduced delay while 64.2% and 32.9% reduced power respectively over DRUM [8]. Similarly, it can also be observed that all the proposed 32-bit LoBA multipliers provide improved design metrics over the 32-bit RoBA multiplier. However LoBA3 consumes little

SSG Unit BKL AKH Barrel Shifter AKL AKLBKL AKHBKH BKH

+

AKHBKL n n Ka Kb1 Ka2 LOD based extractor LOD based extractor

two k−bits BarrelShifter

Kb2 two k−bits K1 K2 K4 Barrel Shifter A B P

Fig. 4 Proposed LoBA2 multiplier architecture

extractor LOD based two k−bits AKH AKL BKH BKL extractor LOD based

two k−bits Barrel

Shifter Barrel Shifter Barrel Shifter Barrel Shifter A B A B A B A B A B

+

Ka1 Ka2 Kb2 Kb1 K1 SSG Unit n n A B P

Fig. 5 Architecture of proposed LoBA3 multiplier

more power/energy over the DRUM, it provides higher accuracy/quality metrics which are discussed in the next subsection.

Further, the performance of the proposed multipliers over the existing multipliers are evaluated in real applications by implementing Gaussian smoothing filters (GSF) embedded with proposed and existing multipliers [9]. The synthesis results are computed and presented for comparative analysis. As shown in Fig. 6(a), the GSF embedded with proposed LoBA multipliers have smaller delay over the GSFs embedded with DRUM [8] and RoBA [23] multipliers. Finally, the energy consumption of the the GSF embedded with proposed and existing multipliers are illustrated in Fig.6(b) where it can be observed that GSF embedded with RoBA requires maximum energy while the GSF with proposed LoBA0 minimum energy.

The next subsection presents accuracy/quality analysis of the proposed multipliers over the existing approximate multiplier architectures.

4.2 Quality/Accuracy Analysis

The accuracy of the proposed multipliers is evaluated as standalone arithmetic unit and in the application over the existing approximate multipliers. To achieve this, the multipliers are modelled in MATLAB, simulated with 1 million random input patterns and quality/error metrics are computed. Further, GSFs embedded with these approximate multipliers are also implemented and simulated with benchmark Lena image for quality analysis in the application [4]. In Gaussian smoothing, smoothed pixel is achieved by performing convolution between input image sub-matrix and Gaussian kernel. For example, for 5× 5 image sub-matrix, the smoothening is given by the Eq. (8). Pxy= 1 28 ⎡ ⎢ ⎢ ⎢ ⎣ 1 3 6 3 1 3 15 25 15 3 6 25 41 25 6 3 15 25 15 3 1 3 6 3 1 ⎤ ⎥ ⎥ ⎥ ⎦∗ ⎡ ⎢ ⎢ ⎢ ⎣ P11 P12 P13 P14 P15 P21 P22 P23 P24 P25 P31 P32 P33 P34 P35 P41 P42 P43 P44 P45 P51 P52 P53 P54 P55 ⎤ ⎥ ⎥ ⎥ ⎦ (8)

(6)

Table 1 Design metrics of approximate multipliers at 65nm PDK

Bit-width Metrics Acc. DRUM RoBA LoBA0 LoBA1 LoBA2 LoBA3

16-bit Area (μm2) 11578 3417 3792 1283 2405 3589 4516 Power* (μW ) 187.1 53.3 190.4 19.03 35.76 52.91 67.4 Delay (ns) 1.79 0.95 1.69 0.59 0.92 1.43 1.70 Energy (f J ) 2247 617.5 1224 206.8 381.7 589.3 932.4 32-bit Area (μm2₎ ₂₄₁₄₅ ₇₂₅₁ ₈₉₃₁ ₃₁₅₆ ₆₀₂₁ ₈₉₀₂ ₉₈₃₂ Power* (μW ) 737.1 210 631 71.6 126.1 193.2 251.5 Delay (ns) 3.16 1.67 2.99 1.05 1.57 1.82 2.21 Energy (f J ) 7662 2105 4131 791 1536 2203 2913

*Normalized power is computed at 100MHz frequency.

where, Pxy represents the smoothed pixel value. In

convolution, the multiplication is performed using proposed and the existing approximate multipliers whereas other operations remain accurate.

This subsection first introduces quality metrics consid-ered for evaluation followed by quality metrics analysis of proposed and existing multipliers as standalone arithmetic unit and in the application.

4.2.1 Quality metrics

In the recent years, with the emergence of approximate computing, several error/quality metrics are introduced to effectively evaluate the approximate designs. Along with the conventional error metrics such as mean error and mean square error (MSE) following error metrics are considered for comparative analysis.

Mean Error Distance (MED) [13]: MED represents the average of error distance (ED) where ED is the absolute error (ED= |Pacc− Papp|). The mathematical expression

of MED is given by Eq. (9)

MED= 1 N N i=1 EDi (9)

Normalized Error Distance (NED)[13]: NED represents the value of MED which is normalized to the maximum value of error. It is design size independent parameter and is given by Eq. (10).

N ED= MED

max(EDs) (10)

Mean Relative Error (MRE) [13]: It reflects the mean of relative error and mathematically expressed by Eq. (11).

MRE= 1 N N i=1 EDi PACCi (11) where PACCiis the accurate value of the product.

0 0.5 1 1.5 2 2.5 3 3.5 4

Acc. DRUM RoBA LoBA0 LoBA1 LoBA2 LoBA3

Dela y (ns) 0 2000 4000 6000 8000 10000 12000 14000

Acc. DRUM RoBA LoBA0 LoBA1 LoBA2 LoBA3

Ener

gy (fJ)

(7)

Table 2 Quality metrics of the proposed and existing approximate multipliers

Metrics DRUM RoBA LoBA0 LoBA1 LoBA2 LoBA3

16-bit MED 3.8x107 2.9x107 8.7x107 6.6x107 5.4x106 1.6x106 NED 0.123 0.114 0.126 0.114 0.106 0.086 MRE 3.8x10−4 3.2x10−4 8.3x10−3 4.4x10−4 5.1x10−4 2.1x10−4 MSE 5.1x1015 _3.5x1015 _7.8x1015 _2.7x1015 _3.8x1013 _1.8x1012 32-bit MED 6.4x1016 _1.3x1017 _2.4x1017 _1.2x1016 _9.4x1013 _6.2x1012 NED 0.123 0.114 0.170 0.112 0.105 0.082 MRE 1.9x10−8 3.9x10−7 5.1x10−7 1.6x10−8 2.1x10−9 1.1x10−9 MSE 1.4x1028 _6.4x1034 _5.5x1032 _1.8x1027 _1.1x1026 _5.0x1024

Table 3 GSF quality metrics of the proposed approximate multipliers

Metrics DRUM RoBA LoBA0 LoBA1 LoBA2 LoBA3

PSNR 29.0 30.4 28.3 30.1 33.7 36.9

SSIM 0.737 0.795 0.712 0.778 0.838 0.893

Fig. 7 Lena images (512×512)

smoothed by GSFs embedded with: (a) DRUM [8], (b) RoBA [23], (c) LoBA0, (d) LoBA1, (e) LoBA2 and (f) LoBA3 multiplier architectures

(8)

Peak Signal to Noise Ratio (PSNR): This parameter is frequently used quality metrics in image processing applications. Its value in dB is given by Eq. (12).

P SN R= 20 log10(Imax/

√

MSE) (12)

where, Imaxis the maximum input signal.

Structural Similarity Index Metric (SSIM)[22]: This metric presents structural based analysis and represents structure of objects in the given image. It value does not depend on the average luminance and contrast. The mathematical expression is given by Eq. (13).

SSI M(x, y)= (2μxμy+ C1)(2σxy+ C2) (μ2

x+ μ2y+ C1)(σx2+ σy2+ C2)

(13) where, μx, μy are the mean values, σx, σy, and σxy are

variances while C1 and C1 are the constant considered to

keep the finite value of the metric. 4.2.2 Error Metrics Analysis

The above mentioned error metrics are computed for the proposed and existing approximate multipliers and summarized in Table 2. The simulation results show that proposed LoBA3 provides minimum value of all error metrics over all the existing multipliers. Further, the error/quality metrics of the proposed multipliers improve as we move from LoBA0 to LoBA3 architecture.

Finally, for the quality analysis of approximate mul-tipliers in the applications, GSF embedded with these approximate multipliers are modelled and simulated with Lena benchmark image in MATLAB. The quality metrics extracted are summarized in Table3. From the simulation results, it can be observed that proposed approximate mul-tipliers provide higher value of the PSNR and SSIM (which is desirable condition) over the existing DRUM and RoBA multipliers. The images filtered via GSFs embedded with various multipliers are shown in Fig.7. The GSFs embed-ded with proposed approximate multipliers provide better image quality over the existing.

5 Conclusion

In this paper, we have presented four high performance approximate multiplier architectures. The proposed mul-tipliers select fewer noteworthy bits (k-bits) from n-bit input operand based on the leading one bit position and computes partial product from these truncated numbers of k-bits. The resultant partial product was shifted left to appropriate value to achieve final approximate product.

Further, for achieving approximate multipliers with differ-ent accuracy/performance trade-off, two significant k-bit were extracted from each input operands and partial prod-ucts were calculated. Based on the number of partial product considered for final product computation, different multi-plier architectures were presented. Finally, the efficacy of the proposed multipliers was demonstrated as an individual arithmetic unit and in the applications on the basis of their quality and design metrics over the existing multipliers.

References

1. Babiˇc Z, Avramoviˇc A, Buliˇc P (2008) “An iterative mitchell’s algorithm based multiplier”. In: 2008 IEEE International Sympo-sium on Signal Processing and Information Technology. IEEE, pp 303–308

2. Garg B, Bharadwaj NK, Sharma G (2014) “Energy scalable approximate DCT architecture trading quality via boundary error-resiliency”. In: 2014 27th IEEE International System-on-Chip Conference (SOCC), IEEE pp 306–311

3. Garg B, Sharma G (2016) “Low power signal processing via approximate multiplier for error-resilient applications”. In: 2016 11th International Conference on Industrial and Information Systems (ICIIS), IEEE, pp 546–551

4. Garg B, Sharma G (2016) A quality-aware energy-scalable Gaussian smoothing filter for image processing applications. Microprocess Microsyst 45:1–9

5. Garg B, Dutt S, Sharma G (2016) Bit-width-aware constant-delay run-time accuracy programmable adder for error-resilient applications. Microelectron J 50:1–7

6. Garg B, Sharma G (2017) ACM: An energy-efficient accuracy configurable multiplier for error-resilient applications. J Electron Test 33(4):479–489

7. Han J, Orshansky M (2013) “Approximate computing:, An emerg-ing paradigm for energy-efficient design”. In: Test Symposium (ETS), 2013 18th IEEE European, pp 1–6

8. Hashemi S, Bahar R, Reda S (2015) “DRUM:, A dynamic range unbiased multiplier for approximate applications”. In: Proceedings of the IEEE/ACM International Conference on Computer-Aided Design. IEEE Press, pp 418–425

9. Jaiswal A, Garg B, Kaushal V, Sharma G (2015) “SPAA-aware 2D Gaussian smoothing filter design using efficient approximation techniques”. In: VLSI Design (VLSID), 2015 28th International Conference on, IEEE, pp 333–338

10. Kahng A, Kang S (2012) “Accuracy-configurable adder for approximate arithmetic designs”. In: Design Automation Confer-ence (DAC), 2012 49th ACM/EDAC/IEEE, pp 820–825 11. Kulkarni P, Gupta P, Ercegovac M (2011) “Trading accuracy for

power with an underdesigned multiplier architecture”. In: VLSI Design (VLSI Design), 2011 24th International Conference on, pp 346–351

12. Kyaw KY, Goh W-L, Yeo K-S (2010) “Low-power high-speed multiplier for error-tolerant application”. In: Electron Devices and Solid-State Circuits (EDSSC), 2010 IEEE International Conference pp 1–4

13. Liang J, Han J, Lombardi F (2011) New metrics for the reliability of approximate and probabilistic adders. Computers, IEEE Transactions on 99:1–1

14. Liu C, Han J, Lombardi F (2014) “A low-power, high-performance approximate multiplier with configurable partial error recovery”.

(9)

In: Proceedings of the conference on Design, Automation & Test in Europe, European Design and Automation Association, p 95 15. Mclaren DJ (2003) “Improved mitchell-based logarithmic

mul-tiplier for low-power dsp applications”. In: IEEE International [Systems-on-Chip] SOC Conference, 2003. Proceedings. IEEE, pp 53–56

16. Mitchell JN (1962) Computer multiplication and division using binary logarithms. IRE Trans Electron Comput 4:512–517 17. Mittal S (2016) A survey of techniques for approximate

computing. ACM Computing Surveys (CSUR) 48(4):62 18. Moreau T, Sampson A, Ceze L (2015) Approximate computing:

Making mobile systems more efficient. IEEE Pervasive Comput-ing 14(2):9–13

19. Narayanamoorthy S, Moghaddam HA, Liu Z, Park T, Kim NS (2015) “Energy-efficient approximate multiplication for digital signal processing and classification applications”. In: IEEE Transactions on Very Large Scale Integration (VLSI) Systems, vol 23, pp 1180–1184

20. Vahdat S, Kamal M, Afzali-Kusha A, Pedram M (2017) LETAM: A low energy truncation-based approximate multiplier. Computers & Electrical Engineering 63:1–17

21. Vahdat S, Kamal M, Afzali-Kusha A, Pedram M (2019) “TOSAM: An Energy-efficient truncation-and rounding-based scalable approximate multiplier,” IEEE Transactions on Very Large Scale Integration (VLSI) Systems

22. Wang Z, Bovik A, Sheikh H, Simoncelli E (2004) Image quality assessment: from error visibility to structural similarity. Image Processing, IEEE Transactions on 13(4):600–612

23. Zendegani R, Kamal M, Bahadori M, Afzali-Kusha A, Pedram M (2017) ROBA Multiplier: a rounding-based approximate multiplier for high-speed yet energy-efficient digital signal processing. IEEE Transactions on Very Large Scale Integration (VLSI) Systems 25(2):393–401

24. Zhu N, Goh WL, Zhang W, Yeo KS, Kong ZH (2010) “Design of low-power high-speed truncation-error-tolerant adder and its application in digital signal processing”. Very Large Scale Integration (VLSI) Systems, IEEE Transactions 18(8):1225–1229

Publisher’s Note Springer Nature remains neutral with regard to

jurisdictional claims in published maps and institutional affiliations.

Bharat Garg received the B.E. degree in Electronics from Rajiv

Gandhi Prodhyogiki Vishwavidhyalaya, Bhopal, in 2001, and M. Tech. (VLSI Design) and Ph.D. from ABV-Indian Institute of Information Technology and Management Gwalior, India, in 2007, and 2017 respectively. Currently, he is working as Assistant Professor in Electronics and Communication Engineering Department in Thapar Institute of Engineering and Technology, Patiala. His research interest includes design and development of Energy Efficient VLSI Architectures and Hardware Security. He is having more than three years of experience in semiconductor industries and more than six years in academic institutions. He has authored/co-authored more than twenty five research papers in peer reviewed international Journals and conferences.

Sujit Kumar Patel received the B.E. degree in Electronics and

Communication Engineering from Jabalpur Engineering College, Jabalpur, India, and M.Tech. degree from the DA-IICT, Gandhinagar, Gujarat, India, in 2006 and 2009, respectively. He has completed his Ph.D. from Jaypee University of Engineering and Technology, Guna, (M.P.). Currently he is an Assistant Professor with ECE department, Thapar Institute of Engineering and Technology, Patiala, Punjab. His research interest includes the VLSI system design for the signal processing algorithms. He has published 3 IEEE transactions and 1 IET journal articles.

Sunil Dutt received his Ph.D. from Indian Institute of Technology

Guwahati, India in 2019. He received his M.Tech. degree in Computer Science (VLSI specialization) from Indian Institute of Information Technology and Management Gwalior, India in 2013 and B.Tech. degree in Electronics & Communication Engineering from The Northcap University (formerly ITM University) Gurgaon, India in 2007. Currently he is working as an assistant professor at Indian Institute of Information Technology Vadodara, India. His research interests include approximate computing, computer architecture, digital circuits and systems, and process-variation aware digital circuits and systems design.