Summary : Analyze the H.264 algorithm from the perspective of hardware implementation, focusing on the optimization of the prediction part that takes up the most computing time, and give improvements to intra prediction, Hadamard transform and motion estimation algorithms. Optimize the hardware for modules with low efficiency and reduce data correlation between modules. The simulation of various test sequences proves that the improvement is effective. H.264 [1] was originally drafted by ITU-T and will become a joint standard of ITU-T and MPEG in the future. H.264 will become the next generation of new video coding standards because it provides high coding compression efficiency and a friendly network-oriented interface. However, while the coding efficiency is very high, the complexity of its algorithm is also increased by four times, which limits its implementation in a large program. Therefore, improvements and optimizations must be made for hardware implementation. H.264's initial test model (JM) [2] was designed to achieve high coding effects. In this test model, there are many algorithms that require a large amount of calculation, but the coding efficiency is not much improved, and many simulations are data-related, which limits the implementation of parallel processing to accelerate hardware. An article has previously analyzed the complexity of this new video coding [3 ~ 5]. However, these studies all obtained the complexity of H.264 algorithm through software analysis. These results are accurate for application in software, but when it comes to parallel processing of hardware design, it is no longer applicable. 1 H.264 algorithm FIG. 1 is a block diagram of an algorithm for inter-frame prediction of an image. If intra prediction is used, the inter prediction part will not be judged. In inter prediction, multi-frame prediction and variable block size motion estimation are used. The coding mode selection section selects an optimal prediction mode among all prediction modes. After prediction, the original input frame and the predicted frame are subtracted to obtain a residual data block. For the luma residual block, a 4 × 4 integer DCT transform is performed, and for the chroma residual block DC coefficients, a 2 × 2 integer DCT transform is performed. After the transformed coefficients are scanned and quantized, the quantized coefficients are entropy coded, and finally become the output code stream. The encoding mode is also input to the entropy Encoder through the mode table. The cyclic process of reconstruction includes inverse quantization, inverse DCT transform and inverse block filtering. Finally, the reconstructed frame is written to the frame buffer, ready for use in future motion estimation. Because almost all the computing power is spent on spatial prediction and temporal prediction, the algorithm improvements on JM 4.0 are mainly in these two parts. In the implementation process, these two parts are realized by hardware, so it is necessary to optimize for the hardware. The hardware system used to implement the encoder is based on macroblocks, which means that the encoder operates on successive macroblocks. The entire coding system can be regarded as a pipeline of a macroblock, so it is possible that when the next macroblock is encoded, the reconstruction process of the previous macroblock is not completed from time to time, which affects the pipeline. Many commercial encoders based on macroblocks use this hardware implementation mode, so it is very important to deal with this problem. 2 Intra prediction The coding block diagram in Figure 1 is similar to that in H.261, H.263 and MPEG-4. H.264 contains 4 × 4 and 16 × 16 intra prediction parts. Intra-frame prediction requires pixel values ​​for image reconstruction to achieve. In a typical macroblock-based system, the reconstructed pixel values ​​can only be obtained after completing the entire encoding process. The correlation between such data will bring great difficulties to the realization of the hardware. 2.1 4 × 4 intra prediction Figure 2 depicts the correlation of data in 4 × 4 block intra prediction. The pixel values ​​from a to p are predicted from the pixel values ​​of A to N and Q. Capitalized letters represent the reconstructed pixel values. Because a macroblock is composed of 16 4 × 4 blocks, the pixel values ​​cannot be reconstructed before the current block is finished encoding. In JM, a two-channel algorithm is used to encode these blocks. In order to make a 4 × 4 block prediction, the process of transformation, quantization, and inverse transformation to inverse quantization is required in JM. This is too complicated for a piece of hardware. It is impossible to achieve on the existing hardware level. To this point, the algorithm needs to be improved as follows: the pixel values ​​of all reconstructed frames in all predictions are replaced by the original values ​​of the input frames. Through such improvements, 4 × 4 intra prediction and transformation can be successfully implemented on the macroblock pipeline. 2.2 16 × 16 intra prediction Figure 3 shows the data correlation of 16 × 16 intra prediction. The prediction of the current macroblock is based on 17 pixels above the current macroblock position and 16 pixels on the left in the reconstructed frame. Because the reconstruction of the left macroblock may not be completely completed when predicting the current macroblock, the original pixels are used when those pixels to the left of the current macroblock position are used. 2.3 Encoding mode selection According to the improved algorithm given above, if the original pixels are simply replaced with the reconstructed pixels, it will cause errors in the selection of the encoding mode. Figure 4 shows the rate-distortion improvement curve of intra-frame coding, the simulated sequence is "Claire", 10fps. It can be seen from Fig. 4 that the PSNR drop caused by the error of the coding mode selection is obvious. The original pixels belong to the same frame, and the reconstructed pixels undergo inter-frame or intra-frame coding to remove redundancy, so the original pixels have a higher correlation than the reconstructed pixels. Therefore, the error generated by the improved intra prediction algorithm is much larger than that of the original algorithm. In order to reduce the error of encoding mode selection, the error cost function (error cost funcTIon) needs to be modified. The current approach is to add an error term. This error term reflects the difference between the original pixel and the reconstructed pixel. Because the quantization parameter (QP) can affect the mismatch between the original pixel and the reconstructed pixel, the determination of the error term is related to the quantization parameter value. In H.264, as the quantization parameter increases linearly, the impact of quantization on the encoding increases exponentially. In order to conform to the growth trend of this effect, the basic form of the error term determines a / b (51-Qp), where a and b are undetermined coefficients. How to determine a and b is the key to influence error elimination. In H.264, the increment of each level of Qp is 12%, so theoretically the parameter b matching it should be set to 1.12. However, the calculation of the error cost function is performed in the transform domain of the Hada code, and the weighting coefficient for each coefficient is different. Moreover, the probability distribution of the transformed coefficients is uncertain. Therefore, the setting of parameter b cannot be used with theoretical values, and should be determined with empirical values. Through experimental simulation results, it can be concluded that for 4 × 4 intra prediction, a is set to 80 and b is set to 1.07. In testing different sequences, this set of parameter values ​​works best. From Figure 4, the improved intra prediction basically eliminates the mode selection error, and its PSNR performance is close to the original intra prediction algorithm. 3 Motion estimation In H.264, variable block size, 1/4 image yarn and multi-reference frame motion estimation are used. In the process of motion estimation, the starting search point of the global search is determined according to the motion prediction factor. For full pixel search, the distortion is measured by SAD. If you need better results, you can add SAD to the compensation term. Although global search motion estimation is supported by various hardware structures, from the perspective of hardware implementation, the selection of the original search range and motion predictor in H.264 is not practical. The corresponding improvements are described below. In the process of hardware motion estimation, on-chip storage is generally used to make up for the lack of off-chip storage bandwidth. A typical method for reusing data in a search area is shown in Fig. 5, where the search range is -16 to +15. The 3 × 3 block on the left in FIG. 5 represents the current macroblock motion estimation area, and the 3 × 3 on the right represents the next macroblock motion estimation area. The data of their overlapping regions can be reused in two macroblock motion estimations. The newly added data is the rightmost 1 × 3 region. In order to cooperate with the H.264 data reuse mode, the starting point of the search area should be set at (0, 0). Only when the real motion vector exceeds the search range, this change will cause the video quality to decline. 3.2 Motion predictors In H.264, the motion predictor is used to determine the number of bits of motion vector data and to calculate the compensation factor for the coding error of motion vector data. The compensation factor is referenced throughout the motion estimation process for rate-distortion optimization. Figure 6 shows the correlation of motion predictors. Where P1 to P4 are macroblocks before the current macroblock. The motion prediction factor of the current macroblock is obtained by calculating the motion vectors of P1 to P4 macroblocks. But because in hardware, when the above macroblock-based processing uses macroblock pipelining, the motion vector of P1 may be invalid. To solve this problem, it is necessary to eliminate the correlation in the calculation process of the motion predictor. Specifically, only the motion vectors of P2 to P4 macroblocks are used in the calculation process. What has changed is the calculation of the motion estimation compensation factor, so the improved algorithm still conforms to the H.264 standard. 3.3 Motion estimation with 1/4 pixel accuracy In H.264, half-pixel motion estimation is achieved by two-dimensional 6-tap interpolation filtering. Two-dimensional filtering requires the use of line buffers to implement transpose operations, and the hardware implementation of line buffers is quite complex. However, when motion compensation is performed on another part of the coding loop, the motion vector of the macroblock has been determined. 3.4 Hada code conversion Hada code transformation is to estimate the number of bits generated after transformation with a simple transformation. In the motion estimation of H.264, the SAD is used to replace the SAD. If it is required to design low-cost hardware, this part can be omitted. 4 Simulation results The software simulation is performed on the "Foreman", "grandma", "salesman" and "carphone" sequences at a frame rate of 10 frames per second. Due to hardware considerations, the rate-distortion optimization mode is not used, because rate control is not used on JM4.0, so the rate-distortion curve is generated by the change in Qp. The rate-distortion curves are shown in Figures 7 and 8. In a system based on macroblock processing, using the improved algorithm described above, parallel processing can be achieved. The results of software simulation show that after improving the algorithms in intra-frame prediction and whole-pixel motion estimation, the decrease in PSNR value is almost negligible. For low-cost systems, the improvement of QME and Hadamard transformation are also various methods that can be considered. Follow WeChat Download Audiophile APP Follow the audiophile class related suggestion
It can be concluded that the key point in the implementation of H.264 hardware is the prediction part, because the calculation time occupied by this module is almost 90% of the total time. Therefore, the focus of improvement is on the prediction part.
3.1 Search scope
In order to reduce the hardware cost, a simpler method can be used to generate 1/4 pixel precision data. Although the 1/4 image data used for motion estimation and motion compensation are not necessarily the same, the error between them will still affect the encoding effect. Therefore, the interpolation process cannot be simply simplified. Using bilinear interpolation instead of two-dimensional 6-tap interpolation filtering can better solve this problem.
It can be seen from the simulation results that in the improved intra prediction algorithm, the PSNR reduction procedure is very low. In the whole-pixel motion estimation of slow motion sequences, PSNR has hardly dropped. Improvements to the QME algorithm will cause a PSNR drop of approximately 0.4 to 0.6 dB. This improvement is acceptable in low-cost systems. In the 64kbps environment, the PSNR of each sequence does not decrease by more than 0.58dB.
Improvement of H.264 video encoding algorithm for hardware implementation
Interesting and informative information and technical dry goods
Create your own personal electronic circle
Lock the latest course activities and technical live broadcast
comment
Publish
'+ data.username +'