I Introduction
Recent years have seen a tremendous interest in artifical neural networks (ANNs), their successful applications in a wide range of problems, including image recognition [1]
and face detection
[2], their promising development on graphical processing units (GPUs) [3], and their efficient hardware implementations on different design platforms, such as analog, digital, hybrid very large scale integrated circuits (VLSI), and field programmable gatearrays (FPGAs) [4]. An ANN is a computing system built up by a number of simple and highly interconnected processing elements [5]. As shown in Fig. 1(a), its fundamental unit, called neuron, sums the multiplication of weights by input variables, adds the bias value to this summation, and propagates this result to the activation function. While the bias value has the effect of increasing or decreasing the input of the activation function, the activation function limits the amplitude of the neuron output
[6]. Mathematically, the neuron behavior can be defined as and where denotes the number of input variables and weights. On the other hand, Fig. 1(b) presents an ANN design including hidden and output layers where each circle denotes a neuron.Observe from Fig. 1 that the hardware complexity of an ANN depends heavily on weight and bias values and is dominated by a large number of multiplications of constant weights by input variables. Over the years, many algorithms and design architectures have been introduced to reduce the hardware complexity of ANNs [7, 8, 9, 10, 11, 12, 13, 14]. In this article, we explore the hardware complexity of ANNs under the parallel and timemultiplexed architectures. Note that a timemultiplexed design, where computations are realized at a time, reusing the computing resources, is preferred to a parallel design in applications with a strict area requirement. However, since the timemultiplexed design needs multiple clock cycles to obtain the final result, it has a higher latency and energy consumption with respect to the parallel design [15]. To further explore the area versus latency and energy consumption tradeoff, in this article, we consider two timemultiplexed architectures. Moreover, since the floatingpoint multiplication and addition operations occupy larger area and consume more energy than their integer counterparts [16], the floatingpoint weight and bias values found during training are converted to integers. Since the sizes of integer weight and bias values have a direct impact on the hardware complexity, we introduce a technique that can find the minimum quantization value, sacrificing a little loss in the hardware accuracy. Also, for each design architecture, we propose an algorithm that can tune the weight and bias values such that the hardware complexity is reduced avoiding a loss in the hardware accuracy. Furthermore, since the ANN design includes a large number of multiplications of constant weights by input variables and these weights are determined beforehand, these constant multiplications are realized under the shiftadds architecture using the fewest number of addition/subtraction operations found by previously proposed optimization algorithms [17, 18, 19]. Experimental results clearly indicate that the hardwareaware posttraining and the multiplierless design lead to a significant reduction in the ANN hardware complexity with a little loss in the hardware accuracy. Moreover, different design architectures present alternative ANN realizations with different hardware complexity so that a designer can choose the one that fits best in an application.
The rest of this article is organized as follows. Section II presents the background concepts and related work. The parallel and timemultiplexed design architectures are described in Section III. The hardwareaware posttraining techniques are introduced in Section IV. Section V describes the multiplierless design of ANN. The CAD tool is introduced in Section VI and the experimental results are presented in Section VII. Finally, Section VIII concludes the article.
Ii Background
Iia ANN Basics
Although the design techniques presented in this article can be applied to different ANN architectures, such as convolutional and recurrent, we consider the feedforward ANNs which do not include any feedback loop. Given the ANN structure including the number of inputs, outputs, layers, and neurons in each layer and the activation functions in each layer, the weight and bias values of ANN are determined in a training phase where the error between the desired and actual values is reduced using an iterative optimization algorithm. Stateofart training algorithms [20, 21, 14] consist of efficient techniques on initialization, optimization, and stopping criteria and include a number of activation functions. The training process is generally carried out offline on processors and/or GPUs. In the testing process, the ANN response on the applied inputs is computed using the weight and bias values determined in the training phase. The ANN computation is generally carried out online on a hardware design platform, such as application specific integrated circuits (ASIC) and FPGAs.
IiB Multiplierless Constant Multiplications
Multiplication of constants by variable(s) is a ubiquitous and crucial operation in many applications, such as digital signal processing, cryptography, and compilers [22]. As illustrated in Fig. 2, constant multiplications can be categorized in four main classes:

The single constant multiplication (SCM) operation realizes the multiplication of a single constant by a single variable , i.e., .

The multiple constant multiplication (MCM) operation computes the multiplication of a set of constants by a single variable , i.e., with .

The constant arrayvector multiplication (CAVM) operation implements the multiplication of a constant array by an input vector , i.e., with .

The constant matrixvector multiplication (CMVM) operation realizes the multiplication of an constant matrix by an input vector , i.e., with and .
Observe that the CMVM operation is the most general case and corresponds to an SCM operation when both and are 1, to an MCM operation when and is 1, and to a CAVM operation when is 1 and .
Since the constants are determined beforehand, these constant multiplications can be realized using addition, subtraction, and shift operations under the shiftadds architecture. Note that parallel shifts can be implemented using only wires in hardware without representing any area cost. A straightforward shiftadds design technique, called the digitbased recoding (DBR) [23], can realize constant multiplications in two steps given as follows: i) define the constants under a particular number representation, such as binary or canonical signed digit (CSD)^{1}^{1}1An integer can be written in CSD using digits as , where . The nonzero digits are not adjacent and a constant is represented with a minimum number of nonzero digits under CSD.; ii) for the nonzero digits in the representation of constants, shift the input variables according to digit positions and add/subtract the shifted variables with respect to digit values. As a simple example, consider the CMVM operation in Fig. 3(a). Its direct realization needs 4 multiplication and 2 addition operations. The DBR method finds a solution with a total number of 8 adders and subtractors when constants are defined under the CSD representation as shown in Fig. 3(b).
The number of adders/subtractors can be further reduced by maximizing the sharing of common partial products among constant multiplications [24, 25, 26, 17, 18, 19]. Returning to our example, the algorithm of [18] finds a solution with 4 operations sharing the subexpression as shown in Fig. 3(c). Moreover, prominent algorithms, that can find multiplierless designs of constant multiplications taking into account the gatelevel area, delay, power dissipation, and throughput of the design, are introduced in [27, 28, 29, 30]. Furthermore, efficient algorithms are proposed for the multiplierless realization of timemultiplexed constant multiplications in [31, 32, 33].
IiC Related Work
For the multiplierless realization of neural networks, binary neural networks (BNNs), where weights values and activation functions are constrained to be either 1 or 1, were introduced in [8]. It is shown that BNNs drastically reduce the memory size and the number of accesses to the memory during training, and replace multipliers with xor operators in hardware. However, they lead to a worse accuracy when compared to conventional neural networks [9]. In [12, 9], weights of ANNs are determined to include a small number of nonzero digits in training and hence, their multiplications by input variables can be realized using a small number of adders and subtractors. In [10], floatingpoint weights in each layer are quantized dynamically, fixedpoint weights are expressed in binary representation, and the ANN is implemented in a hardware accelerator. The multiplierless hardware realization of ANNs is considered in [13] where the multiplication of weights by input variables is realized in a bitserial fashion, defining weights under the CSD representation. In [14], for the timemultiplexed realization of ANN design, a posttraining algorithm, that tunes weights to reduce the hardware complexity, is introduced and the multiplication of constant weights by input variables in each neuron at each layer is realized under the shiftadds architecture.
Under the timemultiplexed design architecture, the multiplyaccumulate (MAC) block is a central operation. To reduce its high latency, a delayefficient MAC structure, which uses accumulators and carrysave adders, was introduced in [34]. Efficient implementation of ANN designs using MAC blocks on FPGAs was introduced in [35]. Recently, MAC blocks have been used in the realization of neuromorphic cores using two models, namely axonalbased and dendriticbased [11].
Iii Design Architectures
In this section, we present parallel and timemultiplexed design architectures used to realize ANNs in hardware.
Iiia Parallel Design
Fig. 4 presents the realization of neuron computations at the layer where and are the number of outputs (or neurons) and inputs at this layer, respectively. Under the parallel architecture, after the ANN inputs are applied, neuron computations at each layer are obtained concurrently, reaching to the ANN outputs.
IiiB TimeMultiplexed Design
The MAC block is a fundamental operation in an ANN design under the timemultiplexed architecture. As shown in Fig. 5, it can be used to realize the neuron computation given in Fig. 1(a), reusing the multiplication and addition operations. Observe that the multiplication of a weight by an input variable is realized at a time synchronized by the control block, which is actually a counter, and is added to the accumulated value stored in the register . In this figure, clock and reset signals are omitted for the sake of clarity. Under this architecture, the neuron computation is obtained after clock cycles. The design complexity of the MAC block depends on the size of the counter and multiplexers, determined by the number of weights and input variables, on the size of the multiplier, determined by the maximum bitwidths of the input variables and weights, and on the size of adder and register, determined by the bitwidth of the inner product of inputs and weights, i.e., .
In this subsection, we present two timemultiplexed architectures to design the whole ANN using MAC blocks. Under the first architecture, called smac_neuron, each neuron at each layer is realized using a single MAC block and under the second architecture, called smac_ann, the whole ANN is realized using a single MAC block. In following, these architectures are described in detail.
IiiB1 smac_neuron Architecture
Fig. 6 presents the neuron computations at the layer of an ANN using MAC blocks and a common control block. The control block synchronizes the multiplication of associated weights by input variables. Assuming that an ANN includes neurons at each layer, where and denotes the number of layers, the required number of MAC blocks is , i.e., the total number of neurons. Note that the complexity of operations and registers in the MAC blocks is determined by the number of inputs and outputs at each layer and the weight values related to each neuron of each layer. The complexity of the control block is determined by the number of inputs at each layer. Since the neuron computations are obtained layer by layer, the neuron computations in a latter layer are started after the neuron computations in a former layer are finished. This is simply done by generating an output signal at each layer indicating that all neuron computations are obtained, which also disables the hardware to do unnecessary computations and enables us to reduce the power dissipation. The computation of the whole ANN with layers and inputs at each layer, where , is obtained after clock cycles.
IiiB2 smac_ann Architecture
Fig. 7 shows the ANN design using a single MAC block where clock and reset signals are omitted for the sake of clarity. In this figure, the control block includes three counters to synchronize the multiplication of a weight by an input variable, the addition of a bias value to each inner product, and the application of the activation function. These counters are associated with the number of layers, number of inputs at each layer, and number of outputs (or neurons) at each layer. Note that the variables denote the primary inputs of ANN and these variables are multiplied by the related weights during the computations at the first layer. While the size of multiplexers for the input variables is determined by the maximum number of inputs at all layers, the size of multiplexers for the weight and bias values are defined by the total number of weight and bias values, respectively. In the MAC block, the size of multiplier is determined by the maximum bitwidth of all input variables and weights and the sizes of adder and register are defined by the maximum bitwidth of the multiplication of weights by input variables in the whole ANN. Moreover, the number of registers used to store the outputs at each layer is determined by the maximum number of outputs at each layer. We note that the computation of the whole ANN with layers, inputs at each layer, and neurons at each layer, where , is obtained after clock cycles.
Iv HardwareAware PostTraining
In this section, we present a technique proposed for finding the minimum quantization value to convert the floatingpoint weight and bias values to integers and methods introduced for tuning weight and bias values to reduce the ANN design complexity under the parallel and timemultiplexed architectures.
Iva Finding the Minimum Quantization Value
While converting floatingpoint weight and bias values found during training to integers, we aim to reduce bitwidths of weight and bias values. To do so, initially, we generate a validation data set, which is used to compute the hardware accuracy, by moving 30% of the training data set to this set randomly. We note that this validation data set is also used while computing the hardware accuracy during the posttraining phase described in following subsections. The proposed technique is described as follows:

Set the quantization value, , and the related ANN accuracy in hardware, , to 0.

Increase value by 1.

Convert each floatingpoint weight and bias value to an integer by multiplying it by and finding the least integer greater than or equal to this multiplication result.

Compute value on the validation data set using the integer weight and bias values.

If and is greater than 0.1%, go to Step 2.

Otherwise, return as the minimum quantization value.
Observe that we sacrifice maximum 0.1% loss in the ANN accuracy in hardware computed on the validation data set in order to use small size weight and bias values.
IvB PostTraining under the Parallel Architecture
The main idea for the posttraining under the parallel architecture comes from the fact that weight values, which have CSD representations with a small number of nonzero digits, lead to constant multiplications with a small hardware complexity as shown in Fig. 3. After weight values are converted to integers using the minimum quantization value , we check if the least significant nonzero digit in the CSD representation of a weight can be removed avoiding a loss in accuracy. The tuning procedure is described as follows:

Set computed in finding the minimum quantization value to the best ANN accuracy in hardware, .

For each weight other than 0, find its CSD representation.

Find an alternative weight for by removing the least significant nonzero digit in the CSD representation of and compute the ANN accuracy in hardware, , when is replaced by .

If , replace by and update .


If at least one weight is replaced by an alternative one, go to Step 2.

Otherwise, return weight values.
Note that in Step 2 of the tuning procedure, the number of nonzero digits in the CSD representation of an alternative weight, which replaces the original weight, is always less than that of the original weight.
IvC PostTraining under the TimeMultiplexed Architectures
The main idea for the posttraining under the timemultiplexed architectures comes from the fact that in the MAC block shown in Fig. 5, if all weights are multiples of , where , the inner product can be realized as , where and stands for the left shift operation. Thus, bitwidths of constant weights to be multiplied by input variables in the MAC block, and consequently, sizes of the multiplier, adder, and register can be reduced. After weight values are converted to integers using the minimum quantization value , under the smac_neuron architecture, for each neuron, we aim to maximize the smallest left shift value among all its weights, denoted as , while avoiding a decrease in the ANN accuracy. As a simple example, the value for the integer values , , and , is computed as 1. The tuning procedure presented for the smac_neuron architecture [14] is described as follows:

Set computed in finding the minimum quantization value to the best ANN accuracy in hardware, .

For each neuron in the ANN design, compute the value for its weights.

For each weight associated the neuron and input at the level, , find its largest left shift value, .

If is equal to , determine the first possible weight as . If the bitwidth of is less than or equal to the maximum bitwidth of weights associated with , compute the ANN accuracy in hardware, , when is replaced by . Similarly, determine the second possible weight as and compute the related ANN hardware accuracy, .

If , replace by the possible weight that leads to the maximum ANN accuracy in hardware and update accordingly.

Otherwise, assuming that is replaced by the possible weight that leads to the maximum ANN accuracy in hardware, change the bias value of the neuron, , in between and compute the ANN accuracy in hardware. If the ANN accuracy in hardware in one of these cases is greater than or equal to , update the values of , , and accordingly.


If value of any neuron is improved, go to Step 2.

Otherwise, return weight and bias values.
Note that in Step 2 of the tuning procedure, a possible weight, which replaces the original weight, has always the largest left shift value greater than that of the original weight.
For the smac_ann architecture, we apply a similar procedure where the increment of the smallest left shift of all ANN weights is aimed in the MAC block as shown in Fig. 7.
V ANNs Under the ShiftAdds Architecture
This section presents the multiplierless realizations of ANN designs under the parallel and timemultiplexed architectures.
Va Multiplierless ANN Design under the Parallel Architecture
A straightforward way for the multiplierless realization of ANN under the parallel architecture is to describe each inner product at each layer, i.e., shown in Fig. 4, as a CAVM operation and to implement each CAVM block independently under the shiftadds architecture. We use the algorithm of [19] to optimize the number of adders/subtractors in the multiplierless designs of these CAVM blocks.
As shown in Fig. 8, all inner products at the layer can be described as a CMVM operation and the number of adders/subtractors in the multiplierless realization of the CMVM block can be reduced using the algorithm of [18]. Thus, the possible sharing of subexpressions can be increased, reducing the number of adders and subtractors in the multiplierless ANN design.
VB Multiplierless ANN Design under the TimeMultiplexed Architectures
Under the smac_neuron architecture, the multiplications of associated weights by input variables at the layer shown in Fig. 6 can be computed in an MCM block and can be diverted to the corresponding adders using multiplexers as shown in Fig. 9. Rather than using an MCM block for each neuron, a single MCM block, which realizes the multiplication of all weights in a layer by an input variable, is used to increase the sharing of partial products, and thus, to reduce the required number of adders/subtractors. The exact algorithm of [17] is used to find the shiftadds realization of the MCM block using a minimum number of adders/subtractors.
Similarly, the multiplierless realization of ANN under the smac_ann architecture presented in Fig. 7 can be obtained when the multiplication of all weight values by the selected input variable is implemented using an MCM block. However, since one multiplier is replaced by a large number of adders/subtractors, such a multiplierless realization increases the hardware complexity significantly.
Vi SIMURG: The CAD Tool
In this section, we present our CAD tool called simurg developed to generate automatically the hardware description of an ANN under the design architectures given in Section III using the posttraining methods mentioned in Section IV and the multiplierless design techniques described in Section V.
Given the ANN structure, including the number of inputs, outputs, hidden layers, and neurons in the hidden layers and the type of activation function of neurons in each layer, initially, the ANN is trained using a stateofart technique and the weight and bias values are determined. In this study, we used three techniques to train the ANN, namely, our training algorithm [14], called zaal
, Pytorch
[20], and matlab neural network toolbox [21]. zaalincludes the conventional and stochastic gradient descent methods, and the Adam optimizer
[36] as an iterative optimization algorithm. It has different weight initialization techniques, such as Xavier [37], He [38], and a fully random method. It includes several stopping criteria, e.g., the number of iterations, early stopping using validation data set, and saturation of loss function. It can define a number of activation functions for neurons in each layer, namely sigmoid, hard sigmoid (hsig), hyperbolic tangent, hard hyperbolic tangent (htanh), linear (lin), rectified linear unit (ReLU), saturating linear (satlin), and softmax
[39].After floatingpoint weight and bias values determined during the training phase are converted to integers with a given quantization value and/or the design techniques introduced in Section IV are applied, the ANN design is described in hardware automatically based on the ANN structure given to a training algorithm, the integer weight and bias values, and the design architecture, i.e., parallel, smac_neuron, or smac_ann. The activation functions used in simurg are hsig, htanh, lin, ReLU, and satlin due to their simplicity in hardware. The tool can define the multiplication of constant weights by input variables in a behavioral fashion. Also, it can find the multiplierless realizations of these constant multiplications as described in Section V. The tool also generates a testbench and necessary files to verify the ANN design and the synthesis scripts automatically. The simurg tool with its limited number of functions is available at https://github.com/leventaksoy/ANNs.
Vii Experimental Results
In this work, we used the penbased handwritten digit recognition problem [40]
as an application. In the convolutional neural network design of this application, we implemented 5 feedforward ANN structures with different number of layers and number of neurons in layers, denoted as
, where stands for the number of ANN primary inputs, which is equal to 16, and , where , indicates the number of neurons in the layer. Note that the activation function of each neuron in the hidden and output layers in training (hardware) was respectively htanh (htanh) and sigmoid (hsig) in our training algorithm zaal and pytorch. In matlab, it was respectively as tanh (htanh) and satlin (satlin). The activation functions were determined based on the software test accuracy found in training. The ANNs were trained using 7494 data and tested using 3498 data. We note that each training algorithm was run 30 times and weight and bias values, that yielded the best accuracy value, were chosen. Table I presents the training and hardware design details on different ANN design structures. In this table, sta, hta, and tnzd denote the software test accuracy, hardware test accuracy, and the total number of nonzero digits in the CSD representations of integer weights and bias values, respectively. Floatingpoint weight and bias values were converted to integers using the minimum quantization value determined as described in Section IV. In the ANN hardware design, bitwidths of ANN inputs and outputs at each layer were determined as 8.Structure  zaal  pytorch [20]  matlab [21]  
sta  hta  tnzd  sta  hta  tnzd  sta  hta  tnzd  
1610  84.6  86.0  431  85.5  85.1  374  89.1  89.3  374 
161010  94.1  93.6  855  95.9  95.2  950  95.9  95.9  857 
161610  96.0  95.9  1245  95.6  95.6  1338  96.9  95.0  1291 
16101010  94.7  94.0  1121  95.8  95.6  1190  96.4  94.7  1121 
16161010  96.6  96.6  1432  96.7  96.7  1608  96.6  95.2  1560 
Average  93.2  93.2  1017  93.9  93.6  1092  95.0  94.0  1041 
Observe from Table I that different training algorithms lead to ANN designs with different hardware accuracy and integer weights and bias values. However, they yield software test accuracy values close to hardware test accuracy values. Note that the difference between the software and hardware test accuracy is due to the quantization value, bitwidths of ANN inputs and outputs at each layer, and different activation functions used in training and hardware design.
In this work, we present gatelevel results of ANN designs implemented in three different architectures, namely parallel, smac_neuron, and smac_ann, as described in Section III. In parallel designs, to make a fair comparison with timemultiplexed designs, flipflops were added to outputs of the ANN design. ANN designs were described in Verilog and were synthesized using the Cadence RTL Compiler with the TSMC 40nm design library.
In order to explore the impact of a design architecture on the ANN hardware complexity, Figs. 1012 present respectively area (in ), latency (in ), and energy consumption (in ) results of ANN designs under the parallel, smac_neuron, and smac_ann architectures where no posttraining technique is applied and constant multiplications are described in a behavioral fashion. Note that the latency is computed as the multiplication of the clock period by the number of clock cycles to obtain the ANN output. The clock period was reduced using the retiming technique in the synthesis tool iteratively. The switching activity data required for the computation of power dissipation was generated using the test data in simulation. This test data set was also used to verify the ANN design. Energy consumption is computed as the multiplication of latency and power dissipation.
Observe that weight and bias values found by different training algorithms lead to ANN designs with different hardware complexity where their impact is clearly observed on ANN designs under the parallel architecture since there exist a large number of constant multiplications. On the other hand, while ANN designs under the smac_ann architecture have the smallest area, the ones under the parallel architecture occupy the largest area. However, the latency of ANN designs under the parallel architecture is significantly smaller than those of ANN designs under the timemultiplexed architectures. Moreover, ANN designs under the smac_ann architecture consume the most energy. Note that area, latency, and energy consumption values of ANN designs under the smac_neuron architecture are in between those of ANN designs under the parallel and smac_ann architectures.
Structure  zaal  pytorch [20]  matlab [21]  
hta  tnzd  CPU  hta  tnzd  CPU  hta  tnzd  CPU  
1610  86.2  224  111  86.0  184  136  89.0  264  113 
161010  92.9  426  338  93.9  421  334  95.3  416  342 
161610  95.1  425  851  94.7  469  996  94.9  609  590 
16101010  93.4  456  912  95.0  498  931  94.9  550  488 
16161010  95.2  544  1127  94.4  615  1254  95.1  693  1207 
Average  92.6  415  668  92.8  437  730  93.8  506  548 
Structure  zaal  pytorch [20]  matlab [21]  
hta  tnzd  CPU  hta  tnzd  CPU  hta  tnzd  CPU  
1610  86.6  279  108  84.9  272  78  88.8  301  87 
161010  93.5  550  515  94.4  563  552  95.3  518  651 
161610  95.9  694  644  95.0  753  765  94.9  813  670 
16101010  93.5  755  544  95.7  699  1259  95.0  726  813 
16161010  95.6  816  789  95.9  918  1489  95.3  991  981 
Average  93.0  618  520  93.2  641  829  93.9  669  641 
Structure  zaal  pytorch [20]  matlab [21]  
hta  tnzd  CPU  hta  tnzd  CPU  hta  tnzd  CPU  
1610  86.1  362  32  85.7  318  24  89.2  339  37 
161010  93.5  611  192  94.8  615  387  95.7  579  170 
161610  95.9  829  253  95.4  781  457  94.9  878  388 
16101010  93.6  770  381  95.8  1057  92  95.1  899  168 
16161010  96.4  960  360  96.5  1426  156  95.7  1041  618 
Average  93.1  706  244  93.6  839  223  94.1  747  276 
Tables IIIV present details on the ANN designs under the parallel, smac_neuron, and smac_ann architectures after the posttraining phase, respectively. In these tables, CPU denotes the runtime of posttraining algorithms given in seconds. Observe from Table I and these tables that the highlevel hardware cost value, i.e., tnzd, is reduced significantly with a little loss in the hardware accuracy. However, there are cases where the hardware test accuracy is increased after the posttraining phase, such as all 1610 ANN designs trained using zaal and 16101010 ANN designs under smac_neuron and smac_ann architectures trained using pytorch.
In order to explore the impact of tuning algorithms on the ANN hardware complexity, Figs. 1315 present respectively gatelevel results of ANN designs obtained after weight and bias values are tuned to reduce the hardware complexity of ANN designs under the parallel, smac_neuron, and smac_ann architectures where constant multiplications are described in a behavioral fashion.
Observe from Figs. 10 and 13 that the use of a tuning algorithm can significantly reduce the hardware complexity of ANN designs under the parallel architecture. When compared to ANN designs realized without the posttraining phase, we note that the maximum reduction on area, latency, and energy consumption is found as 65%, 44%, and 84% on the 161610 ANN design trained using zaal, 16101010 and 161610 ANN designs trained using pytorch, respectively. Similar results are observed on timemultiplexed designs. On the ANN designs under the smac_neuron architecture, the maximum reduction on area, latency, and energy consumption is found as 35%, 15%, and 34% on the 161610 ANN design trained using zaal, 16101010 ANN design trained using matlab, and 16101010 ANN designs trained using zaal, respectively. Finally, on the ANN designs under the smac_ann architecture, the maximum reduction on area, latency, and energy consumption is found as 12%, 19%, and 37% on the 16101010 ANN design trained using zaal, 16101010 and 16161010 ANN designs trained using pytorch, respectively. Observe that the reduction on the hardware complexity of ANN designs under the parallel architecture is greater than that of the ANN designs under the timemultiplexed architectures due to a large number of multipliers and adders under the parallel architecture.
In order to compare the impact of the multiplierless design on the ANN hardware complexity, Figs. 1618 present respectively gatelevel results of multiplierless ANN designs under the parallel architecture using CAVM and CMVM blocks, and of multiplierless ANN designs under the smac_neuron architecture using MCM blocks after the posttraining phase.
Observe from Figs. 13 and 1617 that the multiplierless realization of CAVM and CMVM blocks reduces the area of ANN designs under the parallel architecture significantly. When compared to ANN designs where constant multiplications are described in a behavioral fashion after the posttraining phase, the multiplierless design of ANNs using CAVM blocks leads to a maximum 11% area reduction obtained on the 161010 ANN design trained using pytorch. Note that the multiplierless realization of ANNs using CMVM blocks leads to a larger area reduction than those using CAVM blocks. This is due to the fact that the sharing of subexpressions is increased in the CMVM block, reducing the total number of adders and subtractors. When compared to ANN designs where constant multiplications are described in a behavioral fashion after the posttraining phase, the multiplierless design of ANNs using CMVM blocks leads to a maximum 28% area reduction obtained on the 161610 ANN design trained using matlab. Moreover, the multiplierless design of ANNs using CMVM blocks reduces the energy consumption on average. However, in the multiplierless ANN designs, the latency is increased due to a large number of adders and subtractors in series in CAVM and CMVM blocks. On the other hand, similar results are also obtained on the multiplierless ANN design under the smac_neuron architecture as can be observed from Figs. 14 and 18. When compared to the ANN design where constant multiplications are described in a behavioral fashion after the posttraining phase, the multiplierless design of ANNs under the smac_neuron architecture achieves a maximum 20% area reduction obtained on the 16101010 ANN design trained using matlab.
We note that proposed posttraining and multiplierless design techniques can reduce the ANN hardware complexity independently of the training algorithm, although different training methods lead to ANN designs with different hardware complexity and accuracy.
Viii Conclusions
This article presented alternative implementations of feedforward ANNs under the parallel and timemultiplexed architectures. For each architecture, it also introduced a hardwareaware posttraining phase where weight and bias values of ANNs are tuned in order to reduce the hardware complexity taking into account the hardware accuracy. Moreover, it proposed design techniques to implement ANNs under the shiftadds architecture without using multipliers. Furthermore, it introduced a CAD tool that can automatically generate the hardware description of the ANN design under the given ANN specifications. Experimental results indicate that proposed posttraining methods can reduce the ANN hardware complexity significantly and the multiplierless design of ANNs can reduce the area and energy consumption further, increasing the latency slightly.
Acknowledgment
This work is supported by the TUBITAKCareer project #119E507.
References
 [1] Q. Li, W. Cai, X. Wang, Y. Zhou, D. D. Feng, and M. Chen, “Medical image classification with convolutional neural network,” in International Conference on Control Automation Robotics Vision, 2014, pp. 844–848.

[2]
H. Li, Z. Lin, X. Shen, J. Brandt, and G. Hua, “A convolutional neural network
cascade for face detection,” in
IEEE Conference on Computer Vision and Pattern Recognition
, 2015, pp. 5325–5334. 
[3]
A. Krizhevsky, I. Sutskever, and G. E. Hinton, “Imagenet classification with deep convolutional neural networks,” in
International Conference on Neural Information Processing Systems, 2012, pp. 1106–1114.  [4] J. Misra and I. Saha, “Artificial neural networks in hardware: A survey of two decades of progress,” Neurocomputing, vol. 74, no. 13, pp. 239–255, 2010.
 [5] R. HechtNielsen, Neurocomputing. AddisonWesley, 1990.
 [6] S. Haykin, Neural Networks: A Comprehensive Foundation. PrenticeHall, 1999.
 [7] M. Courbariaux, Y. Bengio, and J.P. David, “Binaryconnect: Training deep neural networks with binary weights during propagations,” in International Conference on Neural Information Processing Systems, 2015, pp. 3123–3131.
 [8] M. Courbariaux, I. Hubara, D. Soudry, R. ElYaniv, and Y. Bengio, “Binarized neural networks: Training deep neural networks with weights and activations constrained to +1 or 1,” arXiv eprints, 2016, arXiv:1602.02830.
 [9] R. Ding, Z. Liu, R. D. Blanton, and D. Marculescu, “Quantized deep neural networks for energy efficient hardwarebased inference,” in Asia and South Pacific Design Automation Conference, 2018, pp. 1–8.
 [10] H. Tann, S. Hashemi, R. I. Bahar, and S. Reda, “Hardwaresoftware codesign of accurate, multiplierfree deep neural networks,” in Design Automation Conference, 2017, pp. 28:1–28:6.
 [11] H. Park and T. Kim, “Structure optimizations of neuromorphic computing architectures for deep neural network,” in Design, Automation and Test in Europe Conference and Exhibition, 2018, pp. 183–188.
 [12] S. S. Sarwar, S. Venkataramani, A. Raghunathan, and K. Roy, “Multiplierless artificial neurons exploiting error resiliency for energyefficient neural computing,” in Design, Automation and Test in Europe Conference and Exhibition, 2016, pp. 145–150.
 [13] T. Szabo, L. Antoni, G. Horvath, and B. Feher, “A fullparallel digital implementation for pretrained NNs,” in International Joint Conference on Neural Networks, 2000, pp. 49–54.
 [14] L. Aksoy, S. Parvin, M. E. Nojehdeh, and M. Altun, “Efficient timemultiplexed realization of feedforward artificial neural networks,” in International Symposium on Circuits and Systems, 2020, accepted for publication.
 [15] K. Parhi, VLSI Digital Signal Processing Systems: Design and Implementation. John Wiley & Sons, 1999.
 [16] M. Horowitz, “Computing’s energy problem (and what we can do about it),” in IEEE International SolidState Circuits Conference (ISSCC), 2014.
 [17] L. Aksoy, E. O. Gunes, and P. Flores, “Search algorithms for the multiple constant multiplications problem: Exact and approximate,” Microprocessors and Microsystems, Embedded Hardware Design, vol. 34, no. 5, pp. 151–162, 2010.
 [18] L. Aksoy, E. Costa, P. Flores, and J. Monteiro, “Multiplierless design of linear DSP transforms,” in VLSISoC: Advanced Research for Systems on Chip, S. Mir, C.Y. Tsui, R. Reis, and O. C. S. Choy, Eds. Springer Berlin Heidelberg, 2012, pp. 73–93.
 [19] L. Aksoy, P. Flores, and J. Monteiro, “ECHO: A novel method for the multiplierless design of constant array vector multiplication,” in International Symposium on Circuits and Systems, 2014, pp. 1456–1459.
 [20] A. Paszke, S. Gross, S. Chintala, G. Chanan, E. Yang, Z. DeVito, Z. Lin, A. Desmaison, L. Antiga, and A. Lerer, “Automatic differentiation in pytorch,” in Conference on Neural Information Processing Systems, Autodiff Workshop, 2017.
 [21] The MathWorks Inc., Deep Learning Toolbox, Natick, Massachusetts, United States, 2020. [Online]. Available: https://www.mathworks.com/help/deeplearning/
 [22] O. Gustafsson, “Lower bounds for constant multiplication problems,” IEEE Transactions on Circuits and Systems II: Express Briefs, vol. 54, no. 11, pp. 974–978, 2007.
 [23] M. Ercegovac and T. Lang, Digital Arithmetic. Morgan Kaufmann, 2003.
 [24] N. Boullis and A. Tisserand, “Some optimizations of hardware multiplication by constant matrices,” IEEE Transactions on Computers, vol. 54, no. 10, pp. 1271–1282, 2005.

[25]
O. Gustafsson, “A difference based adder graph heuristic for multiple constant multiplication problems,” in
International Symposium on Circuits and Systems, 2007, pp. 1097–1100.  [26] M. P. Y. Voronenko, “Multiplierless multiple constant multiplication,” ACM Transactions on Algorithms, vol. 3, no. 2, 2007.
 [27] H. J. Kang and I. C. Park, “FIR filter synthesis algorithms for minimizing the delay and the number of adders,” IEEE Transactions on Circuits and Systems II: Analog and Digital Signal Processing, vol. 48, no. 8, pp. 770–777, 2001.
 [28] S. S. Demirsoy, A. G. Dempster, and I. Kale, “Power analysis of multiplier blocks,” in International Symposium on Circuits and Systems, 2002, pp. 297–300.
 [29] L. Aksoy, E. Costa, P. Flores, and J. Monteiro, “Optimization of area and delay at gatelevel in multiple constant multiplications,” in Euromicro Conference on Digital System Design, Architectures, Methods and Tools, 2010, pp. 3–10.
 [30] M. Kumm, M. Hardieck, and P. Zipf, “Optimization of constant matrix multiplication with low power and high throughput,” IEEE Transactions on Computers, vol. 66, no. 12, pp. 2072–2080, 2017.
 [31] S. Demirsoy, I. Kale, and A. Dempster, “Reconfigurable multiplier constant blocks: Structures, algorithm and applications,” Springer Circuits, Systems and Signal Processing, vol. 26, no. 6, pp. 793–827, 2007.
 [32] L. Aksoy, P. Flores, and J. Monteiro, “Multiplierless design of folded DSP blocks,” ACM Transactions on Design Automation of Electronic Systems, vol. 20, no. 1, pp. 14:1–14:24, 2014.
 [33] K. Möller, M. Kumm, M. Kleinlein, and P. Zipf, “Reconfigurable constant multiplication for FPGAs,” IEEE Transactions on ComputerAided Design of Integrated Circuits and Systems, vol. 36, no. 6, pp. 927–937, 2016.
 [34] Y. Seo and D. Kim, “A new vlsi architecture of parallel multiplier–accumulator based on radix2 modified booth algorithm,” IEEE Transactions on Very Large Scale Integration (VLSI) Systems, vol. 18, no. 2, pp. 201–208, 2010.
 [35] N. Nedjah, R. M. da Silva, L. M. Mourelle, and M. V. C. da Silva, “Dynamic MACbased architecture of artificial neural networks suitable for hardware implementation on FPGAs,” Neurocomputing, vol. 72, no. 10, pp. 2171 – 2179, 2009.
 [36] D. P. Kingma and J. Ba, “Adam: A method for stochastic optimization,” arXiv eprints, 2014, arXiv:1412.6980.

[37]
X. Glorot and Y. Bengio, “Understanding the difficulty of training deep
feedforward neural networks,” in
International Conference on Artificial Intelligence and Statistics
, 2010, pp. 249–256.  [38] K. He, X. Zhang, S. Ren, and J. Sun, “Delving deep into rectifiers: Surpassing humanlevel performance on imagenet classification,” arXiv eprints, 2015, arXiv:1502.01852.
 [39] C. Nwankpa, W. Ijomah, A. Gachagan, and S. Marshall, “Activation functions: Comparison of trends in practice and research for deep learning,” arXiv eprints, 2018, arXiv:1811.03378.

[40]
F. Alimoglu and E. Alpaydin, “Combining multiple representations and classifiers for penbased handwritten digit recognition,” in
International Conference on Document Analysis and Recognition, 1997, pp. 637–640.
Comments
There are no comments yet.