As the newest standard, the High Efficiency Video Coding (HEVC) is specially designed to minimize the bitrate for video data transfer and to support High Definition (HD) and ULTRA HD video resolutions at the cost of increasing computational complexity relative to earlier standards like the H.264. Therefore, real-time video decoding with HEVC decoder becomes a challenging task. However, the Dequantization and Inverse Transform (DE/IT) are one of the computationally intensive modules in the HEVC decoder which are used to reconstruct the residual block. Thus, in this paper, a unified hardware architecture is proposed to implement the HEVC DE/IT module for all Transform Unit (TU) block size, including 4 × 4, 8 × 8, 16 × 16 and 32 × 32. This architecture is designed using the High-Level Synthesis (HLS) and the Low-Level Synthesis (LLS) methods in order to compare and determine the best method to implement in real-time the DE/IT module. In fact, the C/C++ programming language is used to generate an optimized hardware design for DE/IT module through the Xilinx Vivado HLS tool. On the other hand, the LLS hardware architecture is designed by the VHSIC Hardware Description language (VHDL) and using the pipeline technique to decrease the processing time. The experimental results on the Xilinx XC7Z020 FPGA show that the LLS design increases the throughput in term of frame rate by 80% relative to HLS design with a 4.4% increase in the number of Look-Up Tables (LUTs). Compared with existing related works in literature, the proposed architectures demonstrate significant advantages in hardware cost and performance improvement.

Nowadays, several consumer electronic devices such as television [

In video standard codec the Dequantization and Inverse Transform (DE/IT) play a very important role to reconstruct the compressed video sequences [

Recently, field-programmable gate arrays (FPGAs) have been gaining popularity for image and video processing. Indeed, modern FPGAs have sufficient resources to implement a complex application [

In literature, many architectures are proposed to implement the dequantization and inverse transform modules for HEVC decoder. In fact, the design outlined in [

Hence, the aim of this paper is to provide a unified and optimized hardware architecture to implement the 2D-DE/IDCT/IDST module for HEVC decoder. This architecture should support 4 × 4, 8 × 8, 16 × 16 and 32 × 32 HEVC TU block size and offer a trade-off between performance, hardware cost and processing time. For this, the LLS and HLS design flow are used and explored to design a hardware architecture for HEVC 2D-DE/IDCT/IDST module. However, the HLS 2D-DE/IDCT/IDST design is explored through the Xilinx Vivado HLS 2018.1 tool by adding specific directives (e.g., PIPILINE, RESSOURCE, etc) to the high level programming language such as C/C++ code. But, the LLS 2D-DE/IDCT/IDST design is developed by using the pipeline technique through the VHDL language. The hardware architectures for both methods are mapped and evaluated on Xilinx XC7Z020 FPGA for processing time and hardware cost in order to determine which design method (LLS or HLS) provides better design productivity when facing a complex algorithm like the 2D-DE/IDCT/IDST module for HEVC decoder.

The remainder of the paper is structured as follows. Section 2 introduces an overview of HEVC 2D-DE/IDCT/IDST module. Section 3 describes the hardware architecture designed for 2D-DE/IDCT/IDST module using HLS and LLS design flow. The implementation results and performance evaluation are reported in Section 4. Finally, Section 5 concludes the paper.

In HEVC, each frame is partitioned into coding tree block structure involving different sizes of large coding units (LCUs) up to 64 × 64. As illustrated in

Thus, the 2D-DE/IDCT/IDST module receives the coefficients of the TU block from the entropy decoder and applies the dequantization to restore the original Transform coefficients. The dequantization scheme as specified be HEVC is given by

QP%6 | 0 | 1 | 2 | 3 | 4 | 5 |

F (.) | 40 | 45 | 51 | 57 | 64 | 72 |

After dequantization, the 2D-IDCT is performed. In fact, the IDCT module takes dequantized coefficient and performs as two separate 1D-IDCT to obtain is outputs the residual block. The HEVC decoder supports two types of inverse transform which are IDCT and IDST. The IDST is applied only to the 4 × 4 TU block. During decoding, the transformed coefficients are converted back to the spatial domain via an inverse transform. According the HEVC, the 2D-IDCT/IDST can be expressed by

To decrease the implementation complexity of 2D-IDCT/IDST, Chen et al. [_{N}_{N}_{N}_{N,odd}_{N,even}_{N}

In this section, we describe the HLS and LLS hardware architectures designed to implement the HEVC 2D-DE/IDCT/IDST algorithm on Xilinx XC7Z020 FPGA. In this work, the HEVC test Model (HM16.0) [

HLS is gaining more and more popularity specially when using FPGA circuit. Nevertheless, with HLS, it becomes possible to reduce the conception and validation time of the hardware design. Therefore, the exploration and the simulation of multiple hardware architectures can be done in the shortest time. But, HLS requests designers to restructure programs, change a source code and add a specific directive to get a good result. In this context, Xilinx developed the Vivado HLS tool. This tool accepts as input a high-level programming language such as C/C++ and generates automatically as output an RTL hardware description. Through this tool, it is possible to add several directives (such as LOOP unrolling, ALLOCATION, RESOURCE, etc) in order to generate an optimize RTL design in terms of hardware cost and processing time.

For the HLS implementation of the HEVC 2D-DE/IDCT/IDST module, the C code of this module is extracted from HM16.0. The 2D-DE/IDCT/IDST algorithm is implemented with HLS based on the algorithm proposed in

However, in the beginning, the HLS architecture receives as input the TU size, the correspond coefficients (maximum 1024 coefficients) and the QP value. Then, these coefficients are dequantized to generate the transform coefficients. After that, if the TU size is egal to 4 × 4, thus in the first step, the 1D-IDCT4/IDST4 will be applied to the columns of TU to generate the 1D-transfrom coefficients. In the second step, these coefficients are stored in transpose memory to be used for 2D-transfrom. In the last step, the 1D-IDCT4/IDST4 will be applied to the row of TU to reconstruct the residual block. But, if the TU size is equal to 8 × 8, 16 × 16 or 32 × 32, so the 4-point odd, 8-point odd and 16-point odd are used with 4-point even and butterfly module to produce 1D/2D-IDCT8/16/32 coefficients, respectively.

In order to improve the design performances, several directives are added incrementally to the HEVC 2D-DE/IDCT/IDST C code. A part of the C code developed and given as input to Xilinx Vivado HLS tool 18.1 is shown in

The hardware architecture depicted in

The transpose memory is used to store the intermediate coefficients between column and row of the inverse transform. It can store the coefficients of all TU size. The access to memory is optimized by concatenation eight 16-bit coefficients. In fact, in one clock cycles, it is possible to write and read 128-bit which mean eight coefficients in same time.

The control unit serves to share and synchronize data between all units in our design as shown in ^{st} column) to the dequantized units. Then, in the second step, the dequantized units receive the 2^{sd} column after 3 clock cycles and the 1D-IDCT4/IDST4 process the 1^{st} column in 3 clock cycles. After that, in the third step, 1^{st} column is concatenated and stored in transpose memory in one clock cycle, the 2^{sd} column is processed by inverse transform in 3 clock cycles and the 3^{th} column is treated by the dequantized units in 2 clock cycles. Thus, the pipeline technique is used between all units to optimize the processing time. So, for 4 × 4 TU size, firstly the TU is processed column by column by dequantized units and 1D-IDCT4/IDST4 and the output coefficients for each column are stored in transpose memory. This step needs 16 clock cycles. Then, the 1D-IDCT4/IDST4 is performed again row by row from transpose memory. In the end, the DE/IDCT of 4 × 4 TU size is obtained in 29 clock cycles. All these steps are used for 8/16/32 TU size and need 77 clock cycles, 280 clock cycles and 938 clock cycles, respectively as shown in

Design | TU Blocks | LUTs | FFs | RAM_18K | DSP | Freq. (MHz) | Clock cycles |
---|---|---|---|---|---|---|---|

HLS | 32 × 32 | 3600 | |||||

16 × 16 | 22.7K |
18.8K |
24 |
44 |
100 | 935 | |

8 × 8 | 268 | ||||||

4 × 4 | 81 | ||||||

LLS | 32 × 32 | 938 | |||||

16 × 16 | 25K |
13.7K |
16 |
4 |
145 | 280 | |

8 × 8 | 77 | ||||||

4 × 4 | 29 |

On the other hand, the performance of HLS and LLS design for HEVC 2D-DE/IDCT/IDST is measured for several class of video sequences such as Class A (2560 × 1600), Class B (1920 × 1080), Class C (1280 × 720) and Class D (832 × 480). So, from

Comparing our HLS 2D-DE/IDCT/IDST design with the HLS design proposed in [

Ref | Algorithm | Technology | Design specification | Resource cost | FPS |
---|---|---|---|---|---|

[ |
2D-IDCT | HLS FPGA | XC6VLX550T@ 208 MHz | 50.5K LUTs | 1080p@54fps |

[ |
2D-DCT | HLS FPGA | XC7Z |
5.6K LUTs |
1080p@30fps |

[ |
2D-IDCT | LLS FPGA | XC7Z045@135 MHz | 34.6K LUTs |
4K@28fps |

[ |
2D-IDCT/IDST | LLS FPGA | Xilinx Zynq@ 222 MHz | 5.8K LUTs |
4K@30fps |

[ |
2D-IDCT/IDST | ASIC | TSMC 65nm@435 MHz | 183.6 Kgates | 8K@30fps |

[ |
2D-DE/IDCT/ IDST | ASIC | TSMC 40nm@200 MHz | 126 Kgates | 4K@30fps |

[ |
2D-DE/IDCT/ IDST | SW | GeForce GTX 780Ti@1046 MHz | - | 4K@15fps |

Our design | 2D-DE/IDCT/ IDST | HLS FPGA | XC7Z020@100 MHz | 22.7K LUTs |
1080p@13fps |

LLS FPGA | XC7Z020@145 MHz | 25K LUTs |
1080p@65fps |

In this work, a unified hardware architecture is proposed to implement the HEVC 2D-DE/IDCT/IDST module for 4/8/16/32 TU block size. However, two design methods are used to design this hardware architecture which are the HLS and the LLS design flow. Our goal was to compare these two methods and to select the best architecture to implement the HEVC 2D-DE/IDCT/IDST module. It is clear from experimental results under Xilinx XC7Z020 FPGA that the LLS design is more performant than HLS design in terms of processing time and hardware cost. But, the performance of HLS design depends on the selected directives, and the algorithm complexity and can be a good solution to speed up the design time and time to market (TTM).