[IEEE 2008 International Conference on Reconfigurable Computing and FPGAs (ReConFig) - Cancun,...

An FFT/IFFT design versus Altera and Xilinx cores

C. Gonzalez-Concejero, V. Rodellar, A. Alvarez-Marquina, E. Martinez de Icaya and P.Gomez-Vilda

Departamento de Arquitectura y Tecnología de Sistemas Informáticos. Grupo de investigación en informática aplicada al procesamiento de señal e imagen.

Facultad de Informática – Universidad Politécnica de Madrid. Campus de Montegancedo s/n – Boadilla del Monte.

28630(Madrid) -SPAIN [email protected]; [email protected]

Abstract In this paper, a portable hardware design implementing a Fast Fourier Transform oriented to its reusability as a core is presented. The module has been developed using radix-2 Decimation-In-Time algorithm. Structural modeling is implemented using VHDL to describe, simulate and perform the design. The module is portable among different EDA tools and technology independent. It has been synthesized with Quartus II from Altera and ISE from Xilinx. The detailed performance results are presented, as well as a comparison between these and the results provided by Altera and Xilinx FFT IP cores. These show that the proposed design produces better results in the use of physical resources but worsens throughput when compared against the commercial ones. Besides, the IP core from Xilinx shows better throughput than Alteras’s but at a higher implementation cost. 1. Introduction

IP cores are part of the growing Electronic Design Automation (EDA) industry trend towards repeated use of previously designed components. IP cores offered by vendors are rigorously tested and optimized for the highest performance and lowest cost in programmable logic devices. These parameterized IP blocks can be implemented easily, reducing design and test time and also time-to-market because they avoid the process of designing standardized functions from scratch. Ideally these blocks should be entirely portable among different EDA tools and fully parameterizable. But most vendor companies offer only their own non-portable IP cores with many features and functionalities, which sometimes are useless for an specific application. The Fast Fourier Transform (FFT) and its Inverse (IFFT) are fundamental blocks being

used in many applications in science and engineering, such as communications, spectrum analysis, and implementations of digital signal processing, etc. The main FPGA companies as Altera and Xilinx offer FFT/IFFT cores that can be easily embedded in more complex designs with their design tools and are supported and optimized for a wide range of their device families. The FFT/IFFT v5.0 from Xilinx allows transform sizes from 8 to 65536 samples, data precision from 8 to 24 bits, floating point and unscaled and scaled fixed point arithmetic, four different architectures to choose from, block or distributed RAM, run time programmable, etc [2]. The FFT/IFFT v8.0 from Altera allows transformation sizes from 64 to 65536 samples depending on the type of architecture chosen, data precision from 8 to 32 bits, floating point and fixed point arithmetic, embedded memory, multiple I/O data flow modes, etc [3].

In this paper, we present a radix-2 FFT/IFFT design that allows any size points to transform, fixed point arithmetic, pipeline structure and parameterized data format. The synthesis performance results of the proposed model will be compared with the FFT/IFFT cores from the vendors mentioned before and the advantages and disadvantages of each realization will be discussed.

The next section describes the principles of the FFT structure and the mathematical formulation. The architectural design is presented in section 3. Section 4, shows implementation and design results. Finally, conclusions are exposed in section 5.

2. The FFT algorithm

Audio and communications signal processing are well developed lines massively used nowadays in many application lines and products. Since digital communications are quite active fields, the arithmetic

2008 International Conference on Reconfigurable Computing and FPGAs

978-0-7695-3474-9/08 $25.00 © 2008 IEEE

DOI 10.1109/ReConFig.2008.65

337

complexity of the Discrete Fourier Transform (DFT) algorithm becomes a significant factor with impact in global computational costs. Cooley and Tukey [1] developed the well-known radix-2 Fast Fourier Transform (FFT) algorithm to reduce the computational load of the DFT. It can lower the arithmetic complexity from O(N2) to O(N log N) and the regularity of the algorithm is suitable for VLSI implementation. Among different FFT approaches ([4], [5] and [6]), the fixed radix and the split radix methods are two most widely used approaches. A split radix FFT is theoretically more efficient than a fixed radix algorithm [7], since it shows the least computation complexity among traditional FFT algorithms. However the supporting structure would render it less suitable for implementation on digital signal processors. Unlike the irregular butterfly structure of split-radix FFT, fixed-radix FFT is simple to analyze and implement in hardware due to its structural regularity. Therefore, the fixed-radix FFT is by far more widely used although it involves more computations from the algorithmic point of view.

The N-point DFT of a sequence x(k) is defined as [8]:

∑−

=

=1

0

)()(N

k

nkNWkxnX 1,...,1,0 −= Nn

(1)

where

⎟⎠⎞

⎜⎝⎛−⎟

⎠⎞

⎜⎝⎛== −

Nkj

NkeW Njk

Nπππ 2sin2cos/2 (2)

is referred as the twiddle factor, N is the transform size and 1−=j . On its turn k depends on the number of stages and the number of samples.

Similarly the Inverse Discrete Fourier Transform (IDFT) is expressed as:

∑−

=

−=1

0

)(1)(N

k

nkNWnX

Nkx (3)

The algorithm used in the present processor

implementation is the version of the Cooley and Tukey’s Decimation-In-Time (DIT) FFT algorithm. The DIT algorithm first rearranges the input elements in bit-reversed order and then builds up the output transform. Figure 1 shows the form of this scrambling for an 8-point FFT; to the left N input data samples are arranged in bit-reversed order. As it can be seen, the N-point DIT-FFT algorithm consists of log2N stages, each stage consisting of N/2 butterfly operations [9].

The input data are multiplied by the twiddle factor. The solid dots represent addition/subtraction operations. The outputs are arranged in their natural order.

Figure 1. Signal flow graph for 8-point DIT-FFT with input scrambling.

The DIT-FFT radix-2 butterfly is shown in Figure 2 [9]. It takes a pair of complex input data values A and B and produces a pair of complex outputs A’ and B’:

jXxA += (4) jYyB += (5)

where x, y and X, Y are respectively the real and imaginary parts of the input data and:

kNBWAA +=' (6) kNBWAB −=' (7)

Figure 2. Radix-2 butterfly structure

Taking into consideration (2), (4) y (5), the equations (6) and (7) may be written as:

( ) ( )( )

( ) ( )( )]/2sin/2cos/2sin/2cos['

NkyNkYXjNkYNkyxA

ππππ

−++++=

(8)

( ) ( )( )( ) ( )( )]/2sin/2cos

/2sin/2cos['NkyNkYXj

NkYNkyxBππ

ππ+−

+−−=

(9)

3. Architectural design The objective of this paper is to implement expressions (8) and (9) in an efficient way, having in mind the reusability of the resulting design as an embedded core

338

in a possible wide range of applications. The design has been modeled in VHDL according to the restrictions and recommendations for high level synthesis [10]. The design will be portable among different EDA tools and technology independent. This module is designed to be integrated in a Speech Recognition System.

The FFT architecture consists of a single DIT-FFT radix-2 butterfly, a double-port RAM memory to hold the values of the input samples, intermediate operations and results, a control unit, an address generation unit and two ROM memories to store the twiddle factors. The block diagram of the FFT is depicted in Figure 3. The scheduling details are given in Table 1. The architecture of the FFT processor can best be understood by inspecting its operation details. The operation is first partitioned into three main processes. The DATA load, COMPUTE and RESULT unload. The operation cycle starts with the DATA load process. This process consists of reading and loading sample data in the RAM memory. During the COMPUTE process, the kernel butterfly operation is calculated. Finally, in the RESULT unload process the FFT results are made available at the output, ready to be used by another application. A brief description of the main blocks will be given next.

Figure 3. Block diagram of the FFT/IFFT

A. ROMs and RAM

ROM memories store kNW coefficients. The sizes of

these memories are N/4, due to the symmetric properties of the trigonometric functions. Since the amplitude of the sine and cosine are the same in the four quarters, they only differ in the signs. According to the system workflow, two data must be read from the RAM with a cycle delay between them (Table 1, cycle 0 and 1) and loaded to the butterfly unit. Meanwhile, the two outputs of the butterfly block have to be written in the RAM with a cycle delay between them (Table 1 (cont), cycle 6 and 7).

B. Butterfly element The butterfly is the nuclear calculation. The

butterfly takes two data from memory and computes two other data from them. Results are written back to the same memory locations of the inputs since an in-place algorithm is used. This makes efficient use of the available memory as the transformed data overwrites the input data.

The structure of the butterfly employing an straight-forward implementation of (8) and (9), requires four multipliers, three adders, three subtractors and two modules to link the real and imaginary parts of the data (Figure 4).

Figure 4. Butterfly processing architecture

The arithmetic operations involved in this block are performed accordingly with a pipeline data flow structure. The operations to calculate a butterfly demand four time instants (cycle 2 to 5), as it can be seen in the butterfly scheduling shown in Table 1.

C. Address generation and control units

The purpose of the Address Generation Unit (AGU) is to produce valid addresses for the RAM and the ROM blocks. It also keeps track of which butterfly is being computed in which stage. The block level description of the AGU basically consists of a log2N-bit up counter, a ram_index generator and rom_index generator. The counter output is used to address the RAM during the DATA load and RESULT unload processes. During the DATA load process data should be bit-reversed while being written, but no extra hardware is required for implementing the bit-reversed, it may simply be carried out by wire reversal. Moreover, the counter keeps track of the current stage in the FFT computation, and supplies the ram_index generator with the number of the stage that is currently being computed. The ram_index generator is responsible for generating addresses for the RAM during the COMPUTE process. The input of the ram_index is the address provided by the counter. The addresses to read and write data inputs, A and B, can be calculated as follows:

339

cycle 0 1 2 3 4 RAM read

A(x,X) B(y,Y) nextA1 nextB1 nextA2

ROM read

cosφ nextcosφ1

ROM read

Sinφ nextsinφ1

Mult M1=Ycosφ nextM11

Mult M2=Ysinφ nextM21

Mult M3=ycosφ nextM31

Mult M4=ysinφ nextM41

+/- S1=M3+M2 +/- S2=M1-M4 +/- S3=x+S1+/- S4=X-S1 +/- S5=x+S2 +/- S6=X-S2 Link Link RAM write

RAM write

Table 1. Butterfly scheduling (cycles 0-4)

cycle 5 6 7 8 9 RAM read

nextB2 nextA3 nextB3 nextA4 nextB4

ROM read

nextcosφ2 nextcosφ3 nextcosφ4

ROM read

nextsinφ2 nextsinφ3 nextsinφ4

Mult nextM12 nextM13 Mult nextM22 nextM23 Mult nextM32 nextM33 Mult nextM42 nextM43 +/- nextS11 nextS12 nextS13

+/- nextS21 nextS22 nextS23

+/- nextS31 nextS32 +/- nextS41 nextS42 +/- nextS51 nextS52 +/- nextS61 nextS62 Link A’=

S3+jS5 nextA’1 nextA’2

Link B’= S4+jS6

nextB’1 nextB’2

RAM write

A’ nextA’1

RAM write

B’ nextB’1

Table 1(cont). Butterfly scheduling (cycles 5-9)

The address B is calculated just changing the bit ‘1’ to ‘0’ in the fragment of the algorithm shown before.

The rom_index generator is responsible for producing addresses for the ROM during the COMPUTE process. It only requires knowing the current stage to generate de address.

The control unit is implemented as a finite state machine with twelve states. The sequence of events is determined by the control unit depending on the signals it receives from the corresponding units and also generates other control signals to take care of housekeeping duties, i.e, incrementing and clearing counters. 4. Implementation results Generally speaking it is very difficult to make a fair comparison among design performance because there is not a standard benchmarking methodology for FPGA`s. Current CAD tools provide a settings menu that allow to explore different trade-offs among design performance, logic resources demanded, power consumption, memory usage and compilation time. Additionally, user constraints can be included to guide the CAD tool to improve performance results. But the settings producing the best results for one design may not be appropriated for another. The compilation results that will be presented next were obtained with default settings and no constraints. Our model has been synthesized with Quartus II v8.0 and ISE v10.1, and their results have been compared against the FFT/IFFT cores available in the DSP libraries of these CAD tools, which are v8.0 for Altera and v5.0 for Xilinx. These cores have been included using the MegaWizard Plug-In Manager tool for Altera and CoreGen tool for Xilinx. Their structures and detailed pin count can be found in [2][3]. The device selection is also critical due to the differences in the FPGA inner architectures, some designs being easily implemented in a specific architecture while others are not. To take this aspect into consideration, we have chosen families of the FPGA vendors that may be considered as technologically comparable. Concerning the election of the specific devices to implement the designs, the criterion has been to choose a device of enough size to support a real time speech recognition system (our goal application), where the FFT IP will be an embedded block. The target devices used for performing the designs have been Stratix II EP2S15F484C3 from Altera and Virtex IV xc4vlx15-12sf363 from Xilinx. IP commercial cores offer different possibilities of configuration (arithmetic, radix, architectures, number of butterfly engines, I/O modes, etc.) that must be carefully selected to the closest characteristics of the design in order to render comparable performance results. The summary of the characteristics of our design is: Decimation in Time FFT algorithm (DIT), radix-2, fixed point arithmetic two’s complement, single butterfly engine, pipeline structure, number of samples (N), data size and twiddle factors parameterized, structure implemented with 4

340

multipliers/6 adders. The options chosen for the commercial IP core generation were: Xilinx: Unscaled Arithmetics (full precision) fixed-point, burst I/0 architecture because it uses the DIT method and radix-2, output in natural order and data and twiddle factors in RAM. Altera: Arithmetic block floating-point, burst data flow I/0 due to it’s the only option to generate a single output FFT engine, number of parallel engines =1 and 4 multipliers/2 adders implementation. The synthesis results of our design vs. FFT from Altera and Xilinx are shown in Table 2 and Table 3. The results presented in both, are for N = 64, 128, 256, 512, 1024, 2048 and 4096 samples. The data and twiddle sizes are 16 bits in all cases. The upper part of each cell (grey shadow) contains the results for the design (OC). The lower part of the cells contains the results for IP vendors (VC). The comparisons among the results have been carried out in terms of the physical resources, number of pins, memory occupation, DSPs and Fmax used. The number of resources available in the devices and the amount required for each particular implementation are also indicated. It must be noticed that the percentages shown in the tables are provided by the tools. They round up or round down depending on the case.

OC ALUTs (12480)

Registers (12480)

Pins (343)

Mem. bits (419328)

DSP (96)

Fmax MHz VC

64 228(2%) 478(4%) 67(20%) 3190(<1%) 8(8%) 262.33669(5%) 1365(11%) 85(25%) 2560(<1%) 8(8%) 265.75

128 234(2%) 483(4%) 67(20%) 6271(1%) 8(8%) 260.76709(6%) 1486(12%) 85(25%) 4864(2%) 8(8%) 258.42

256 250(2%) 488(4%) 67(20%) 12424(3%) 8(8%) 258.4712(6%) 1430(11%) 85(25%) 9472(3%) 8(8%) 286.20

512 253(2%) 491(4%) 67(20%) 24721(6%) 8(8%) 251.57743(6%) 1532(12%) 85(25%) 18688(4%) 8(8%) 247.65

1024 256(2%) 494(4%) 67(20%) 49306(12%) 8(8%) 250.69750(6%) 1477(12%) 85(25%) 37120(9%) 8(8%) 254.07

2048 258(2%) 497(4%) 67(20%) 98467(23%) 8(8%) 244.62803(6%) 1578(13%) 85(25%) 73184(18%) 8(8%) 225

4096 260(2%) 500(4%) 67(20%) 164012(49%) 8(8%) 236.67816(7%) 1522(12%) 85(25%) 147712(35%) 8(8%) 246.84

Table 2. Altera results for EP2S15F484C3

By comparing the results obtained from Altera’s IP with the results from this design the following similarities and differences may be observed. In both cases, the number of demanded ALUTs remains in a similar percentage for all values of N but the implementation of the IP vendor demands around three times more resources than ours. The same behavior is observed for the number of registers needed. The number of pins is constant for all values of N but our design needs 18 pins less than the IP vendor. Obviously, the demand of memory is proportionally increasing with the size of N. The performance of this parameter is better in Altera’s core than in this design and in our case it gets worse as N increases. According to the percentages given by the CAD tool, a difference of 1% for N=256, 3% for N= 1024 and 14% for N=4096 can be appreciated. The number of DSP is the

same in both cases. As expected in this case, the frequency is decreasing according to the increment of N but in the case of Altera the behavior seems to be erratic. It can be noticed from Table 2 that frequency increases and decreases as N duplicate. In a first analysis, the results seem to lack consistence, it may be observed that for N=128 samples the value of F = 258.42 MHz, for N=256 the value increases to 286.20 MHz, for N=512 it decreases to 247.65 MHz and for N=1024 the value increases to 254.07 MHz, the same tendency can be observed for the rest of the values shown in Table 2. Analyzing these results grouping the odd and even power of two of N, the results show a different interpretation. In both groups the frequency decreases as the N increases. For odd powers of N (128, 512 and 2048) the frequency decreases as 258.42 MHz, 247.65 MHz and 225 MHz. respectively. And for even powers of N (256, 1024, and 4096) it decreases as 286.20 MHz, 254.07 MHz and 246.84 respectively. By comparing our results with these groups it may be concluded that our results are better for the odd powers but worse for the even powers of N. This same erratic behavior maybe noticed for the number of registers. The reason given by the vendor is that for burst architectures radix-4 decomposition is normally applied unless N is an odd power of two then the FFT megacore automatically implements a radix-2 in the last pass to complete the transformation.

OC FF

(12288) LTUs

(12280) Pins (240)

SLICEs (6144)

RAM 48

DSP (32)

Fmax MHz VC

64 465(3%) 324(2%) 67(28%) 283(4%) 3(6%) 4(12%) 204.031068(8%) 839(6%) 100(41%) 749(12%) 5(10%) 6(18%) 351.11

128 471(3%) 340(2%) 67(28%) 291(4%) 3(6%) 4(12%) 203.651123(9%) 888(7%) 104(43%) 779(12%) 5(10%) 6(18%) 352.66

256 477(3%) 346(2%) 67(28%) 295(4%) 3(6%) 4(12%) 201.461191(9%) 944(7%) 108(45%) 836(13%) 5(10%) 6(18%) 351.22

512 481(3%) 353(2%) 67(28%) 301(4%) 3(6%) 4(12%) 201.331275(10%) 1035(8%) 112(46%) 903(14%) 5(10%) 6(18%) 351.15

1024 485(3%) 361(2%) 67(28%) 305(4%) 4(8%) 4(12%) 200.251331(10%) 1098(8%) 116(48%) 941(15%) 6(10%) 6(18%) 350.04

2048 489(3%) 368(2%) 67(28%) 310(5%) 6(12%) 4(12%) 200.201408(11%) 1172(9%) 120(50%) 1003(16%) 9(18%) 6(18%) 350.02

4096 493(4%) 375(3%) 67(28%) 314(5%) 12(25%) 4(12%) 200.151476(12%) 1229(10%) 124(51%) 1043(16%) 15(31%) 6(18%) 349.95

Table 3. Xilinx results for xc4vlx15-12sf363

Concerning the results for Xilinx shown in Table 3, in our solution the demand of slice FLIP-FLOPs shows almost a constant percentage for all values of N, whereas in the IP commercial core these resources increase according to N, and can be noticed that the difference between both designs is larger as N is increased. A similar behavior is observed for the number of LUTs and occupied slices. In our case, the number of pins remains the same (67) for all values of N and Xilinx IP requires 4 pins more each time N duplicates. In both implementations the RAM blocks remain around the same percentage (6% and 10%) up to 512 samples and it increments for the rest of the values in the Table. The number of DSP is the same for all values of N but our implementation uses 2 DSP’s

341

less than the Xilinx’s one. The results for all physical resources commented above are better in our implementation than in the implementation from Xilinx IP but this produces much better results for Fmax, over passing our solution in around 150 MHz. Concerning latency, our design presents poor results compared with the commercial ones because it needs 4 cycles to make the calculations of one butterfly while the others only need one cycle. The total number of cycles estimated and the throughput for calculating a complete FFT for 256 and 1024 samples are given in Table 4.

Cycles Altera OC

Xilinx OC

Altera IP

Xilinx IP Throug. ( μs)

N = 256

4352 4352 1626 1670 16.81 21.60 5.68 4.75

N = 1024

21504 21504 7277 7354 85.77 107.38 28.64 21.00

Table 4. Throughput for 256 and 1024 samples

The throughput of our core is better when implemented in Altera than in Xilinx being around 3 times faster for the first and between 4.5 to 5 times for the second. If we compare Xilinx’s and Altera’s IPs the latency is similar but Xilinx achieve higher frequencies and better throughput results. 5. Summary and conclusions

This paper presents an N-point FFT/IFFT architecture which is portable among different EDA tools and technology independent. The design is oriented to its reusability as a core. The performance of the design has been compared with the commercial cores provided by Altera and Xilinx vendors. Those cores were configured with the closet characteristics to our design in order to make the results comparable. The performance of our design presents better results in terms of physical resources demanded but the throughput is poorer when compared with the IP commercial implementations. Concerning IP commercial cores, Xilinx gives better throughput than Altera. The implementation cost between them is difficult to evaluate in a fair manner because the FPGA´s inner structures are different but in a first approach, taking as reference the results of our design, the implementation for Xilinx seems to be more costly. Along with these performance results come other considerations which need to be evaluated to select the best approach depending on system requirements like easy implementation, costs and performance. The generation of a design from an IP commercial core is as easy as to press a button but you don’t have any control over the design because they are provided as a

black box. They offer a variety of features and functionalities to be configured and supposedly their implementations are optimized for a subset of their devices, giving the best performance for them but they lack portability. Besides the economical cost, the system requirements could need less performance than that offered by IP commercial cores and this is the case of the present application. Our FFT design has been integrated as part of a Speech Recognition System for isolated commands and implemented in a FPGA together with the other parts of the system such as end point detection, MFCC feature extraction and HMM modeling. In this case the physical resources performance in order to have full implementation of the system in the same FPGA is more important than other criteria used, as far as real time processing is achieved and this condition is fulfilled with the design described in this paper.

6. Acknowledgments

This work was supported by grants CCG06-UPM/INF28, TEC 2006-12887-C02-00 from Plan Nacional de I+D, Ministry of Education and Science and by Project HESPERIA (http://www.proyecto-hesperia.org) from the Programme CENIT, Ministry of Industry, Spain. 7. References [1] J.W. Cooley, J.W. Tukey , An algorithm for the machine calculation of complex Fourier series, Math of comp, 1965, vol.9, pp. 297-301. [2]www.xilinx.com/support/documentation/ip_documentation/xfft.pdf [3] www.altera.com/literature/ug/ug_fft.pdf [4] J.-Y. Oh, M.-S. Lim, A radix-24 SDF pipeline FFT processor for OFDM modulation, in: The First IEEE VTS APWCS (Asia Pacific Wireless Communications Symposium), January 2004. [5] Lihong Jia, Yonghong Gao, Jouni Isoaho and Hannu Tenhunen, A new VLSI-oriented FFT algorithm and implementation, IEEE ASIC Conf., 1998, pp. 337–341. [6] Saad Bouguezel, M. Omair Ahmad and M.N.S. Swamy, An efficient split-radix FFT algorithm, Int. Symp. Circuits Systems , 2003, pp. 65–68. [7] S. G. Johnson and M. Frigo, A modified split-radix FFT with fewer arithmetic operations, IEEE Transactions on Signal Processing, 2007, pp. 111-119. [8] B. J. Proakis and D. G. Manolakis, Digital Signal Processing: Principles, Algorithms and Applications, 2nd ed. New York: Macmillan, 1992 [9] W. B. Jervis and E. C. Ifeachor, Digital Signal Processing: A Practical Approach. Reading, MA: Addison-Wesley, 1993. [10] M. Keating and P. Bricand, Reuse Methodology Manual: For System-on-a-Chip Desings. Third Edition. Kluwer Academic Publishers, 2002.

342

[IEEE 2008 International Conference on Reconfigurable Computing and FPGAs (ReConFig) - Cancun,...

Documents

Transcript of [IEEE 2008 International Conference on Reconfigurable Computing and FPGAs (ReConFig) - Cancun,...