DESIGN AND IMPLEMENTATION OF A VHDL PROCESSOR FOR DCT BASED IMAGE COMPRESSION

This paper describes the design and implementation of a VHDL processor meant for performing 2D-Discrete Cosine Transform (DCT) to use in image compression applications. The design flow starts from the system specification to implementation on silicon and the entire process is carried out using an advanced workstation based design environment for digital signal processing. The software allows the bit-true analysis to ensure that the designed VLSI processor satisfies the required specifications. The bit-true analysis is performed on all levels of abstraction (behavior, VHDL etc.). The motivation behind the work is smaller size chip area, faster processing, reducing the cost of the chip.


INTRODUCTION
Image compression techniques can be divided into two classes: lossless and lossy compression.Lossless image compression is particularly useful in applications such as image archiving (as in the storage of legal or medical records) and facsimile transmission.However, most of the applications today use lossy image compression technique because of its higher compression ratio compared with lossless image compression, and this is crucial for many image applications.The lossy image compression techniques have become particularly popular in systems having limited transmission bandwidth and storage capacity.There are various schemes and standards for lossy image compression.One of them is Joint Photographic Expert Group (JPEG) is the most widely used standard for image compression.The popularity of DCT is growing rapidly, and it is being used in various video coding standards such as MPEG-1 and MPEG-2.Hence, the development and implementation of DCT chip is very important and is therefore the subject of study in this paper.
The DCT algorithm was first introduced by Ahmed et al. in 1972 [1] and until today one of the most well known and widely used transform technique in digital signal processing especially for image compression because of its excellent energy compaction characteristic.This type of transform can be computed using the Fast Fourier Transform (FFT) [2].The DCT process is applied on blocks of image by 8  8 or 16  16 pixels that convert the image into series of coefficients, which define spectral composition of the block.The 2-D DCT is a separable transform consisting of Forward Discrete Cosine Transform (FDCT) and Inverse Discrete Cosine Transform (IDCT).Simplified diagrams of FDCT and IDCT process are shown in Figure 1 and Figure 2 respectively.With sophisticated processing schemes at hand and further promising advances in multimedia algorithm research to come, efficient VLSI implementation assumes enormous importance.Conventional DSPs are highly optimized for processing speech/audio and lack the high performance needed for video processing.Though programmable DSPs may reach higher performance levels for desktop computing, but are typically weak at signal processing also too expensive as well as power consuming for typical multimedia applications.Hence, there is a great importance for developing the DCT chip as outlined in this work.This paper describes the steps involved in the design and implementation of a VHDL processor for DCT based image compression.Section 2 describes the computation of DCT and section 3 describes the implementation of the DCT algorithm based on MATLAB and DSP software.Section 4 describes the simulation results of the algorithm based on the same software's and section 5 describes the implementation of DCT architecture processor.Section 6 presents the VHDL model's simulation result of the processor.The comparison simulation results are presented in section 7. The design synthesis is described in section 8. Discussion and Conclusions are given at the end of the paper.

DCT COMPUTATION
First, a source image is first partitioned into blocks of 8  8 pixels each.The FDCT of each block is then computed [3] using equation (1).
N 1 = Total number of coefficients for row matrix, N 2 = Total number of coefficients for column matrix.
W = Weighting factor, Re{ .}= the real part of a complex number, Fourier Transform.
The FDCT employs the 2-D FFT algorithm for transformation from time domain to frequency domain.The 2-D FFT, V (k 1 , k 2 ) of sequence of image signal x (n 1 , n 2 ) can be computed 3 using equation (2).
The FDCT outputs represent a set of 64 DCT coefficients whose values are uniquely determined by the particular 64-point input signal.The DCT coefficient values thus are regarded as the relative amount of the 2D spatial frequencies are contained in the 64-point input signal.The coefficient with zero frequency is called the DC coefficient and remaining coefficients are called AC coefficient.Until this stage, no compression is carried out on the image blocks, but the representation of the source image has been changed.However, this transformation usually gives a compaction of energy of the image blocks into fewer coefficients.In the next step, the 64 DCT coefficients are quantized by the corresponding quantization factor from the quantization table, which consists of information annex of the draft JPEG standard part-1 [4].The quantization table is shown in Figure 3 The DC coefficient is treated separately from the 63 AC coefficients in the quantization table.
The DC coefficient is a measure of average value of the 64 image samples.To get the reconstructed image, the dequantization step is performed by multiplication with the quantization table followed by the Inverse Fast Fourier Transform (IFFT).The relation in equation ( 3) can be used for the fast computation of the IDCT.2) are known, then the decoded image signal x (n 1 , n 2 ) can be recovered easily using the above equations.
Firstly, the DCT algorithm has been applied in MATLAB simulation sessions to achieve image compression in this work.Here image data of gray scale image ("Lena") consisting of 2 8 (= 256) levels in ASCII format has been used as the input image.This input image data file contains a total of 4096 sub-image blocks (i.e.25,7944 pixels).The Pseudo-code based on MATLAB is given in Figure 4.

DESIGN FLOW OF THE DCT ALGORITHM
The design flow starts with the MATLAB environment and is then followed by the specification of the system behavior in the DSP station from Mentor Graphics and Aldec Active HDL 3.5 environment as shown in Figure 5. Initially, the DCT algorithm has been applied on the "Lena" image file on a PC using MATLAB software.After getting the confidence on result, the specification has been made using the Data Flow Language (DFL) [5], an applicable language suitable for describing DSP algorithm from Mentor Graphics.After compilation, the design tool of EDA is used to transform the DFL specification into a binary database called Signal Flow Graph (SFG).The SFG does not describe the amount of memory or type of architecture required in implementing the algorithm.Thus, once an algorithm has been approved it can be implemented in different hardware architectures without starting each time at the highest level of specification.Next, the SFG is synthesized and optimized using the Mistral2 (MS2) [ In addition, MS2 also generates the VHDL (VHSIC Hardware Description Language; VHSIC-Very High-Speed Integrated Circuit) model and a test bench of the processor architecture.Later, the MS2 generated VHDL model has been rewritten under Aldec Active HDL 3.5 environment.Then, the VHDL model is simulated by the simulator tool.The generated VHDL codes have been synthesized by the synthesis tool known as "Synplify."Finally, the processor netlist will be mapped on a target technology.

SIMULATION RESULTS
It is necessary that DSP algorithms are simulated at different levels to ensure that the design works correctly.Primarily, time domain simulation has been performed on the design viewpoint at two different levels; namely, high level and bit true simulation.For high-level simulation, the signal flow graph (SFG) is translated into a piece of C code where the signals become integer or floating point variables.This way one is able to verify the functionality of the algorithm.To

DFL Description
Develop the DFL specification of the designing using the (DA) module in the DSP station from Mentor Graphics.
Architecture Synthesis Develop a bit parallel architecture and also generate VHDL model by the MS2.

VHDL Simulation
Develop generated VHDL model by the simulator tool under Aldec Active HDL 3.5 Environment.

Logic Synthesis
Synthesize the gate level layout on a target technology using the "Synplify" tool and optimize the design for silicon area.

DSP Station
Input file " Lena" Computation of DCT coefficients and its subsequent quantization using MATLAB simulation session

PC Environment
Fig. 5: Steps in the design flow.
completely validate a DSP algorithm, the simulator must behave exactly like the real implementation of the system.To completely validate the quality of a future implementation, the design is thus tested through bit true simulation.
The FDCT and IDCT coefficients of the original "Lena" image have been computed first using MATLAB and then with DSP station software.As an example, the final few outputs of IDCT coefficients from a sub-image frame obtained by bit true simulator under time domain simulation are presented in table 1.  1.The minor difference is due to the internal architecture of PC (16-bit) and workstation (64-bit).

IMPLEMENTATION
The DCT algorithm has been implemented into an application specific DSP processor using MS2 module.The first step in the hardware implementation is the translation of the DFL module of DCT algorithm into an architecture level description known as Register Transfer Level (RTL).The MS2 module, which is an architecture synthesis tool, is employed to synthesize processor architecture consisting of a controller, a data path and a micro program for the processor as shown in Figure 6.
The controller consists of a micro-ROM and a combinatorial circuit, which controls the bitparallel data path using the micro code stored in the micro-ROM.MS2 also generates the architecture netlist of the processor, which is a textual representation of the interconnection of different blocks that constitute the bit-parallel architecture of the processor.The data path consists of bit-parallel Execution Units (EXU), which is connected to the buses through the output buffers.The basic EXUs are an Arithmetic Logic Unit (ALU), with an Address Calculation Unit (ACU), a parallel multiplier accumulator, a ROM/RAM, buffers and multiplexers.Finally, MS2 generates a synthesizable VHDL code along with a VHDL test bench.

VHDL SIMULATION RESULTS
The Application Specific Integrated Circuit (ASIC) design process was started by the MS2 generated VHDL code where the overall system was specified by behavioral level VHDL description [7].A test bench in VHDL was used to verify the correctness of the design concept.This was tested by comparing the systems output data with an expected data computed by a DFL program, for all possible input vectors under Mentor Graphics environment.To generate a correct codes,, the MS2 generated VHDL codes were re-written under Aldec Active HDL 3.5 environment and its characteristics were simulated by the simulator tool using all possible input vectors under Aldec Active HDL 3.5 environment.
Once the model was correctly constructed, a functional model was developed to model the architecture of the system.Several different chip architectures were considered during the phase based on several driving factors such as high testability, small physical system, simple I/O handling, high performance IC and minimization of silicon area.To verify the correctness of design, the simulation results of VHDL code in Figure 7 have been compared with the simulation results of the DFL code for VLSI processor implementation.It has been noticed that the obtained numerical data through Aldec Active HDL 3.5 cannot give us the timing information of the design.In order to find the timing it helps us to fine tune our design and add constraints to meet the timing goals.As an example, the delay between the input and the output signals is approximately 140 ns.The timing diagram of our design is given in Figure 8.

COMPARISON OF SIMULATION RESULTS
The second step in the design was to develop the 2-D DCT algorithm using DFL programming language in Mentor Graphics.Once the algorithm was verified, the third step, a behavioral model of each of the functional block was prepared.This step gives the ability to test and modify the architecture of the system.Partitioning the behavioral VHDL model, individual modules of the datapath and the controller of the system were simulated by using synplify simulation tool.The FDCT and IDCT coefficients of the original "Lena" image have been computed using VHDL programming language.As an example, the final few outputs of IDCT coefficients corresponding to a sub-image block obtained by VHDL simulator are presented in Table 2.It has been found from the experiments that the two outputs of coefficients given by DFL and VHDL from DSP station are almost equal.

DESIGN SYNTHESIS
Synthesis is the process of transforming one representation in the design abstraction hierarchy to another representation [8]."Synplify" tool, which is used to synthesize the compiled VHDL, design into gate-level schematics for the components of DCT chips consisting of FDCT and IDCT blocks of the whole design.During this process, the technology mapping option " Altera FLEX 10K architecture" has been chosen to perform the whole synthesis process of VHDL codes.As an example, the components of FDCT blocks is described with the synthesize process in this paper.There are 65 functionality components of FDCT's sub-blocks have been carried out in this design.The input and output signals layout of total 65 components of FDCT block is given into three parts in Figure 9  In order to reduce the complexity of the design, all the complex operations will be sub-divided into small components.Each component was compiled and mapped separately to obtain the Register Transfer Logic (RTL) and the technology view using synplify tool is shown in Figure 10.During the synthesis process, the RTL view shows the actual input/output signal names of each block's component to understand the VHDL codes in specification level.
The "synplify " is also used to optimize gate level design for area by applying specified options.It initially processes the VHDL input code into generally lower level logic building blocks such as multiplexers, decoders, registers and ALUs for which it can determine whether logic blocks can be shared between functions for efficiency.The second step is the conversion of the generic functions into vendor-specific library cells, followed by the optimization steps to achieve the speed constraints and logic minimization where speed is not critical.While synthesizing the design with "synplify," HDL library browser is used to synthesize the design in a hierarchical manner.The synthesized schematic also needs to be simulated to make sure that the synthesized design functions the same as the validated VHDL model.
As an example, one gate level view ("S_CAR_6969_8E8E") from technology view of the FDCT block for DCT chip is shown in Figure 11.Finally, the gate level design codes have to be downloaded into Field Programmable Gate Array (FPGA) [9] to verify the correct synthesized design codes of DCT before fabricating the ASIC processor design.

DISCUSSION AND CONCLUSIONS
The digital number representation issues are one of the main factors in any DSP design.Usage of the fixed point and floating point representations are the most common practice in any DSP computation.The floating point arithmetic is suitable for computations involving a higher dynamic range while fixed point offers higher precision for a given word length constraint.
During the experiments, the fixed-point format has been considered to compute the coefficients of image frame.Another issue is that the standard MS2 cannot handle division operation in Mentor Graphics environment.This leads to some problems faced in the design of processor architecture from the DFL specification of DCT algorithm.The DFL specification consists of few equations based on standard trigonometric functions such as cosine & sine functions.These trigonometric functions have been computed using few operators such as addition, subtraction, multiplication, and division.During the design of processor architecture, the MS2 cannot support the library of cosine and sine functions because of division operator.Finally, the cosine and sine functions have been manipulated to get processor architecture in Mentor Graphics environment.
The MS2 also generates the VHDL codes in parallel to get the processor architecture.The few generated VHDL codes are not synthesizable by the synthesis tool at the first time.To get correct synthesis result, certain changes have been made in the VHDL codes.The paper has presented a successful design and implementation of DCT encoder and decoder algorithms simulated with all steps of MATLAB, DFL, VHDL and synthesize software on a DSP station.The simulated coefficients of DCT algorithm based on VHDL software from DSP station have been simulated with the simulator of the design tool for the implementation of VLSI processor.

Fig. 1 :
Fig. 1: Simplified diagram of a FDCT process along with quantization process.

Fig. 8 :
Fig. 8: The Timing Diagram of the Design.
(a), (b) & (c) respectively.The input image signals (input1-input64) have been fed to the component layout of FDCT block of Figure 9(a) and the generated output signals (G1-G64) & (I1-I64) are connected to the component layout of FDCT block (FDCT_coefficient_block) of Figure 9(b).Finally, the generated output signals (N1-N32) and (N33-N64) are used to complete the component layout of FDCT block (FDCT_Quantization) of Figure 9(c) for getting the reconstructed image signals using the functionality components of IDCT's sub-blocks.

Table 1 :
Comparison of simulation results of DCT coefficients obtained by MATLAB and DFL.
It has been found from experiments in this work that the two sets of coefficients given by MATLAB and DSP station are almost equal as shown in table

Table 2 :
Comparison of simulation results of DCT coefficients obtained by MATLAB, DFL and VHDL.