Spatial-Domain Transcoder with Reused Motion Vectors in H.264/AVC Class Report, (ICE815) Special Topics on Image Engineering, Fall 2006 Wonsang You KAIST ICC Munjiro, Yuseong-gu, Daejeon, 305732, Korea
[email protected] 1. Introduction Transcoding architectures for compressed videos can be classified as a spatial-domain transcoding architecture (SDTA), a frequency-domain transcoding (FDTA), and a Hybrid-domain transcoding architecture (HDTA). A spatialdomain transcoding architecture has the advantage that it can achieve more flexible transcoding. Since most parts of the encoder module and the decoder module are separated each other, this transcoder can perform the transcoding in quantization step-size, picture resolution, and so on. However, this type of transcoders also suffers from high computational complexity. To reduce its computational complexity, common information of the decoder module such as motion vectors can be reused by the encoder module. Since the motion estimation is the most time-consuming procedure, it is very effective to reduce the transcoding time. This report provides an explanation about the software implementation of H.264/AVC spatial-domain transcoder with motion vector reused. . I introduce the spatial-domain transcoding architecture (SDTA) of H.264/AVC transcoder and the methodology of its software implementation. It is shown that it can perform the transcoding in quantization parameter faster than the basic form of spatial-domain transcoding. 2. The Structure of Transcoder The spatial-domain transcoding architecture with motion vector reused is the transcoding form that its decoder module shares the motion information with the encoder module. Since the encoder module receives total motion information from the decoder module, the encoder module does not need to perform the motion estimation procedure. Fig. 1(b) shows a spatial-domain transcoding architecture that reuses the motion information. Unlike the basic form of spatial-domain transcoding that performs the motion estimation as shown in Fig. 1(a), it does not contain the motion estimation procedure. Instead, it uses repeatedly the motion information among the data decoded by its decoder module to perform the motion compensation and the variable length coding in the encoder module. The motion compensation includes the estimation of motion vectors as well as the reconstruction of each picture from the motion information and reference pictures. Motion vectors are estimated to reduce the bit-rate of encoded bitstream; that is, the difference between the original motion vector and the predicted motion vector is just encoded by the entropy coding. It needs to be noticed that the motion information is encoded by the variable length coder. Three components of the motion information such as motion vector, reference index, and macroblock type are encoded by different types of the variable length coder. For example, the macroblock types and the reference indices are encoded by the unsigned exponential Golomb coding; on the other hand, the motion vectors are encoded by the signed exponential Golomb coding. As shown in Fig. 1(a), the spatial-domain transcoding architecture with motion vector reused can include the optional function for adjusting the spatial or temporal resolution. This optional function consists of two modules such as spatial/temporal resolution reduction (STR) module and MV composition and refinement (MVCR) module. STR supports the input bitstream to be transcoded with reduced resolution; MVCR adjusts motion vectors according to transcoded resolution. In our report, the trasncoder is designed without STR and MVCR since we put our purpose in checking the change of quantization parameters instead of resolution. (a) (b) Fig. 1. The spatial-domain transcoding architectures: (a) with the basic form and (b) with motion vector reused Like the basic form of spatial-domain trasncoder, the spatial-domain transcoder with motion vector reused the input bitstream to be fully decoded by the decoder module which is inside the transcoder; then, the raw data which is generated by a decoder module is again used for the motion compensation along with motion information in the cascaded encoder module. While the basic transcoder delivers just the decoded raw data to the encoder module, the new transcoder with motion vector reused delivers not only the decoded raw data but also motion information to the encoder module. Nevertheless, it reduces the computation time effectively since the motion estimation which is the most timeconsuming procedure is not necessary for motion compensation. 3. Background A.. The Structure of the Transcoder In this section, a software implementation of H.264/AVC transcoder, which is made using the reference software JM 10.2, is introduced in detail. The brief procedure of this software is shown in Fig. 2. Start Configure() Open Configuration File OpenBitstreamFile() Open Encoded Input File allocate_buffer_for_encoder() Allocate Buffers main() decode_one_frame() Initialize_encoder_main() Initialize Transcoder exit_picture() DeblockPicture() writeIntoFile() enc_encode_single_frame() decode_one_frame() encode_sequence() Decode a Frame writeIntoFile() Write Data into Buffers enc_encode_single_frame() Encode a Frame All Frames Encoded? Y Finalize Transcoder N FrameNum++ End Fig. 2. The brief procedure of H.264/AVC transcoder The configuration file and the encoded input file are opened and read by two functions: Configure() and OpenBitstreamFile(). After saving the input data, buffers for the motion information and raw data, which will be decoded from the decoder module, are allocated as four variables such as YUVb, dec_ref, dec_mv, and B8mode by the function allocate_buffer_for_encoder(). YUVb is the buffer for saving the decoded raw data. dec_ref and dec_mv are the buffers for reference indices and motion vectors. B8mode is the buffer for 8x8 sub-block modes in a macroblock. Next, the initialization of encoder module is performed by the function initialize_encoder_main(). Then, the main transcoding procedure is performed in every frame by the function decode_one_frame() which contains the procedure of decoding all slices. This function is ended by the function exit_picture() which contains the deblocking function DeblockPicture(). The deblocking function includes the buffer-writing function write IntoFile() which performs writing the decoded raw data and the motion information into temporary buffer. Especially, the motion information is stored in a buffer by the function store_motion_into_buffer(). This bufferwriting function includes the frame-encoding function encode_sequence() which resides in the function enc_encode_single_frame(). Such relationship of functions is shown in Fig. 2. B. Variable Length Coding In this section, it is shown how the variable length coding is performed in the JM reference software. Since this mechanism informs what variables are related to motion information, it is necessary to know the structure of the variable length coding and motion compensation. The variable length coding exists in both the decoder module and the encoder module. It should be noticed that the variable length coding procedure is performed in the unit of macroblock. In the decoder module, the variable length decoding (VLD) is performed by the function read_one_macroblock() which reads the macroblock information such as macroblock type, intra prediction mode, and motion vectors from the encoded bitstream. On the other hand, in the encoder module, the variable length encoding (VLE) is performed by the function writeMBLayer() which writes the syntax elements of a macroblock into the bitstream. These syntax elements include macroblock type, intra prediction mode, coded block pattern, quantization parameter, motion information, and residual block data. The summarized algorithm of the VLD function read_one_macroblock() is shown in Fig. 3. In this function, the macroblock type is stored as the variable currMB->mb_type after it is read from the bitstream. Then, the motion vector mode and the predicting direction are extracted from this macroblock type by the functions interpret_mb_mode_I() or interpret_mb_mode_P(). The motion vector mode represents how an inter-coded macroblock is separated by the sub-block such as 8x4, 4x8, and so on; it is assigned to the variable currMB->b8mode. Likewise, the predicting direction is assigned to the variable currMB->b8pdir. In the meantime, the motion information is read by the function readMotionInfoFromNAL() which includes the prediction module of motion vectors. Fig. 3 shows the abbreviated algorithm of this function. This function read the motion vectors and the reference indices. While the reference indices are assigned to the variable dec_picture>ref_idx, the motion vectors are assigned to the variable dec_picture->mv. Actually, the motion vector information interpreted from the bistream is intrinsically the motion vector difference between the original motion vector and the predicted motion vector. Since the decoder module also has the prediction function SetMotionVectorPredictor(), it can generate the predicted motion vector by itself; this fact notices that the bit-rate can be reduced through the encoding of not motion vectors but motion vector difference. The original motion vector can be reconstructed from this predicted motion vector and the motion vector difference. This calculation procedure is shown in Fig. 3. By the addition of the motion vector difference curr_mvd and the predicted motion vector pmv[k], the reconstructed motion vector vec is generated and stored at the variable dec_picture->mv. // read MB mode: read_one_macroblock() dP->readSyntaxElement(&currSE,img,inp,dP); currMB->mb_type = currSE.value1; if (img->type==P_SLICE) interpret_mb_mode_P(img); else if (img->type==I_SLICE) interpret_mb_mode_I(img); // read the reference indices: readMotionInfoFromNAL() readSyntaxElement_FLC(&currSE, dP->bitstream); refframe = 1 - currSE.value1; dec_picture->ref_idx[LIST_0][img->block_y + j][img->block_x + i] = refframe; // read the motion vectors: readMotionInfoFromNAL() SetMotionVectorPredictor (); dP->readSyntaxElement(&currSE,img,inp,dP); curr_mvd = currSE.value1; vec = curr_mvd + pmv[k]; dec_picture->mv[LIST_0][j4+jj][i4+ii][k] = vec; Fig. 3. The brief algorithm of reading the motion information in the decoder module It should be reminded that the inter-coded macroblocks can have various types of sub-block. Nevertheless, the motion vectors are assigned to all 4x4 blocks in a macroblock, in order to simplify the decoding procedure; this concept is shown in Fig. 5. We may call this process as the uniformizing of motion vectors. By the similar way with the variable length decoding, the variable length encoding (VLE) is performed by the function writeMBLayer() which is included in the function write_one_macroblock(). This function includes two modules; one is the motion information encoding module, and the other is the macroblock type encoding module. The motion information encoding is performed by the module writeMotionInfo2NAL() which writes the motion information into the bitstream. This function consists of two functions: the reference encoding function writeReferenceFrame() and the motion vector encoding function writeMotionVector8x8(). The function writeReferenceFrame() performs the variable length encoding of the reference indices which are stored in the variable enc_picture->ref_idx. If the reference frame is the first preceding frame, the number of reference frame is not encoded. If it is the second preceding frame, it is encoded as just one bit. On the contrary, if it is more previous than the second preceding frame, the number of reference frame is encoded in itself according to the unsigned exponential Golomb encoding method. The function writeMotionVector8x8() performs the variable length encoding of motion vectors which are stored in the variable img->all_mv. This variable is accumulated from enc_picture->mv in every macroblock. Motion vectors are encoded not as original motion vectors but as differences between original motion vector and predicted motion vector; then, Fig. 4 shows this relationship. In the figure Fiq. 4, curr_mvd represents the motion vector difference while all_mv and pred_mv indicate the original motion vector and the predicted motion vector. This motion vector difference is encoded by the signed exponential Golomb encoding method. In the meantime, the macroblock type encoding module is performed by the function writeMBLayer(). If the current macroblock is not included in the I-slice, the macroblock type is encoded by the run-length coding scheme. That is, it counts the number of skipped macroblocks which exist from the previous non-skipped macroblock to the current nonskipped macroblock; this number is stored at the variable img->cod_counter. Then, this number img->cod_ counter and the macroblock type currMB->mb_type are basically encoded according to the unsigned exponential Golomb encoding method. // write the macroblock type: writeMBLayer() currSE->value1 = img->cod_counter; dataPart->writeSyntaxElement( currSE, dataPart); currSE->value1 = MBType2Value (currMB->mb_type); currSE->value1--; dataPart->writeSyntaxElement(currSE, dataPart); // write the reference indices: writeReferenceFrame() ref = enc_picture->ref_idx[LIST_0][j][i]; currSE->value1 = ref; dataPart->writeSyntaxElement (currSE, dataPart); // write the motion vectors: writeMotionInfo2NAL() curr_mvd = all_mv[j][i][list_idx][refindex][mv_mode][k] - pred_mv[j][i][list_idx][refindex][mv_mode][k]; currSE->value1 = curr_mvd; dataPart->writeSyntaxElement (currSE, dataPart); Fig. 4. The brief algorithm of reading the motion information in the encoder module Fig. 5. The uniformizing of motion vectors for the coding process C. Motion Compensation In the encoder module, the motion information, which is generated by the variable length decoding function in the decoder module, are provided to the motion compensation function as well as the variable length encoder. The motion compensation is performed by the function LumaResidualCoding() which gets the inter-predicted frame and calculates the displaced frame difference (DFD). Fig. 8 shows the overall flow diagram of this function. First, the function SetModesAndRefframe() sets the mode parameters and the reference frames like Fig. 6. Here, the motion vector mode currMB->b8mode is extracted from the macroblock mode by the function SetModesAndRef frameForBlocks() before executing the residual coding. Then, the residual data of 8x8 sub-macroblocks are encoded by the function LumaResidualCoding8x8() which consists of the function LumaPrediction4x4() for 4x4 blocks. This function requires the motion vectors img->all_mv as well as the mode parameters and the reference frames. The final predicted frame is stored at the variable img->mpr[y][x]. Last, the displaced frame difference is quantized and transformed by the function dct_luma(). From the discussion so far, the variables of motion information, which are used at the decoder module, are the macroblock type currMB->mb_type, the reference indices dec_picture->ref_idx, the motion vector mode currMB->b8mode, and the motion vectors dec_picture->mv. On the other hand, in the encoder module, these are corresponded to three variables such as e_currMB->mb_type, enc_picture->ref_idx, currMB->b8mode, and enc_picture->mv. Accordingly, we need the mediating buffers which deliver these data from the decoder module to the encoder module. These buffers are shown in Fig. 7. *fw_ref =enc_picture->ref_idx[LIST_0][img->block_y+j][img->block_x+i]; *bw_ref = 0; *fw_mode = currMB->b8mode[b8]; *bw_mode = 0; Fig. 6. The setting of the mode parameter and the reference frame Variable Name Macroblock type Motion vector mode Predicting direction Reference index Motion vector Decoder Module currMB->mb_type currMB->b8mode currMB->b8pdir dec_picture->ref_idx dec_picture->mv Buffer MBmode B8mode B8pdir dec_ref dec_mv Encoder Module e_currMB->mb_type e_currMB->b8mode e_currMB->b8pdir enc_picture->ref_idx enc_picture->mv Fig. 7. The buffers for delivering motion information from the decoder module to the encoder module Start SetModesAndRefframe() Set Modes and Reference Frames 4x4 Coding LumaPrediction4x4() 4x4 Luma Prediction Get Displaced Frame Differences (DFD) LumaResidualCoding8x8() dct_luma() DCT/Quantization/ Inverse Quant./IDCT Reconstruction All 4x4 Blocks Checked? Y All 8x8 Blocks Checked? Y N block4++ N block8++ End Fig. 8. The flow diagram of the function LumaResidualCoding() 4. Software Implementation A. Getting the Motion Information from the Decoder Module The motion information can be classified as five components such as macroblock type, motion vector mode, predicting direction, reference indices, and motion vectors. Among them, three components such as macroblock type, motion vector mode, and predicting direction may be gotten when each macroblock is decoded. On the contrary, the remaining components such as reference indices and motion vectors are stored at the buffer when each picture is decoded. As a result, the former components, which is called the “macroblock information”, is extracted by the function store _mbinfo_into_buffer() which is inserted in the slice decoding function decode_one_slice() like Fig. 9. In this figure, after one macroblock is decoded by the function decode_one_macroblock(), the macroblock information such as macroblock type currMB->mb_type, motion vector mode currMB->b8mode, and predicting direction currMB-> b8pdir are stored at buffers like Fig. 9. It should be noticed that a macroblock has four motion vector modes and four prediction directions; it indicates that each 8x8 sub-block inside a macroblock has its own motion vector modes and prediction direction. The motion vector mode and prediction direction are decided in the unit of 8x8 sub-blocks. On the other hand, the latter components, which is called the “motion vector information”, is extracted by the function store_motion_into_buffer() which is inserted in the buffer-writing function writeIntoFile() like Fig. 9. As noticed before, the function writeIntoFile() does not only store the decoded raw data and the motion information into buffers but also performs the encoding procedure with different parameters by the function enc_encode_single_ frame(). After decoding a picture, the function store_motion_into_buffer() stores the motion vector informa- tion such as motion vectors dec_picture->ref_idx and reference indices dec_picture->mv into buffers as shown in Fig. 9. In the case that the picture has the QCIF size, the X and Y sizes in the 4x4 block unit are 44 and 36. Thus, the number of motion vectors in a picture is 1584 (44x36). void decode_one_slice(struct img_par *img,struct inp_par *inp) { while (end_of_slice == FALSE) // loop over macroblocks { start_macroblock(img,inp, img->current_mb_nr); read_flag = read_one_macroblock(img,inp); decode_one_macroblock(img,inp); store_mbinfo_into_buffer(img); exit_slice(); } } void writeIntoFile(StorablePicture *p, struct img_par *img) { … store_motion_into_buffer(); enc_encode_single_frame(); frameNumberToWrite++; } void store_mbinfo_into_buffer(struct img_par *img) { struct macroblock *currMB = &img->mb_data[img->current_mb_nr]; for(int k=0; k<4; k++) { B8mode[img->current_mb_nr][k] = currMB->b8mode[k]; B8pdir[img->current_mb_nr][k] = currMB->b8pdir[k]; } MBmode[img->current_mb_nr] = currMB->mb_type; } void store_motion_into_buffer() { for(int by=0; by<36; by++) for(int bx=0; bx<44; bx++) { dec_ref[by][bx] = dec_picture->ref_idx[LIST_0][by][bx]; dec_mv[by][bx][0] = dec_picture->mv[LIST_0][by][bx][0]; dec_mv[by][bx][1] = dec_picture->mv[LIST_0][by][bx][1]; } } Fig. 9. The algorithm of getting the motion information from the decoder module B. Putting the Motion Information into the Encoder Module After getting the motion information from the decoder module, the motion information is stored in temporary buffers. The encoder module can gain these data from buffers in order to perform the variable length encoding and the motion compensation without motion estimation. These data can be uploaded to the encoder module whenever each macroblock is encoded through the function encode_one_macroblock() like Fig. 10. In this macroblock encoding function, the inter-coding part should be removed to stop the motion estimation procedure. Notice that the cost of an inter mode should be the minimum value if the current macroblock is a macroblock in P-slice or B-slice. The H.264/ AVC encoder compares all possible modes to select the mode with the minimum cost as the best mode. However, since a macroblock in P-slice or B-slice already has its own motion information from the decoder module, forcing this macroblock to be inter-coded gives little effect on the bit-rate efficiency. Although a macroblock in P-slice or B-slice is forced to be inter-coded, it provides lower computational complexity than performing the complex mode decision procedure. In this case, the best mode also is chosen to be the macroblock mode which is gotten from the decoder module. void encode_one_macroblock() { if (!intra) { (removed part of inter-coding) init_motion_info(e_img); best_mode = currMB->mb_type; min_cost = 0; } … } void init_motion_info(e_ImageParameters *e) { struct e_macroblock *e_currMB = &e->mb_data[e->current_mb_nr]; int int int int bx, by, b8, x, y; nr = e->current_mb_nr; block_x = (nr % 11) * 4; block_y = (nr / 11) * 4; e_currMB->mb_type = MBmode[nr]; for(b8=0; b8<4; b8++) { e_currMB->b8mode[b8] = B8mode[nr][b8]; e_currMB->b8pdir[b8] = B8pdir[nr][b8]; } for(by=0; by<4; by++) for(bx=0; bx<4; bx++) { y = block_y + by; x = block_x + bx; if(x<44 && y<36) { enc_picture->ref_idx[LIST_0][y][x] = dec_ref[y][x]; enc_picture->mv[LIST_0][y][x][0] = dec_mv[y][x][0]; enc_picture->mv[LIST_0][y][x][1] = dec_mv[y][x][1]; } } } Fig. 10. The algorithm of putting the motion information into the encoder module Analyzing the macroblock encoding function encode_one_macroblock() is valuable to understand why the intercoding part can be removed correctly by setting the cost as the minimum value. Fig. 11 shows the brief flow diagram of this function. According to this flow diagram, if the inter-coding part is removed, all macroblock will be intra-coded since the intra mode will be selected as the best mode. On the contrary, if we set the cost of an inter mode as the minimum value, the final mode of the corresponding macroblock will be decided as the inter mode. The function SetModesAndRefframeForBlocks() sets the final mode and the reference information for a macroblock after the best mode is decided. 5. Experiments To test the basic transcoder explained so far, I used the part of Foreman video formatted with the QCIF as shown in Fig. 12(a). When we decode this video encoded directly with the quantization parameter QPISlice and QPPSlice equal to 36 without transcoding, we can get a video like Fig. 12(b). On the other hand, if we transcode this encoded video with the same quantization parameter, we get the output video as shown in Fig. 12(c). This transcoding result is the same as the video encoded directly without transcoding. Fig. 12(d) shows the result of transcoding with the same quantization parameter in the condition that the motion information is reused in the encoder module. This picture is also the same as the transcoding result of Fig. 12(c). Start RandomIntra() init_enc_mb_params() Initialization Choose Best Mode Only Intra? (I slice) Y N Inter Prediction High Complexity Mode? N Low Complextiy Intra Prediction Y High Complexity Intra Prediction Set Macroblock Parameters Set Modes & Reference Frames SetModesAndRefframeForBlocks() Set MB Parameters set_stored_macroblock_parameters() SetCoeffAndReconstruction8x8() Set Coefficient & Reconstruction for 8x8 Residual Coding End Fig. 10. The brief flow chart of the function encode_one_macroblock() Table 1. The characteristics of directly encoded videos and the transcoded video QP=12 Bitstream Size SNR Y SNR U SNR V Encoding time ME time 21 KB 48.91 49.68 50.66 3.468 sec 0.980 sec QP=36 1.80 KB 31.60 38.71 38.89 2.578 sec 1.202 sec QP=36 (with ME) 1.77 KB 31.71 38.60 39.14 2.547 sec 0.967 sec QP=36 (without ME) 2.18 KB 31.03 38.65 38.96 1.516 sec 0.000 sec (a) (b) (c) (d) Fig. 11. The comparison of direct encoded video and transcoded video; (a) An original Foreman video, (b) a video encoded directly with the quantization parameter 36, (c) a video transcoded with the quantization parameter 36, and (d) a video transcoded with the quantization parameter 36 in the condition that the motion information are reused. Some characteristics of the encoded bitstream in each case are shown in Table 1. The size of the bitstream transcoded with motion vector reused is similar with that of the bitstream transcoded basically although there exists a little difference. Such a difference seems to be generated from the work forcing macroblocks in P-slice or B-slice to be intercoded. Actually, even though a macroblock is included inside P-slice or B-slice, it can be intra-coded. In this case, the transcoding error is generated. The SNR performances are nearly similar each other; then, the above three results with the same quantization parameter have the same distortion performance and quality visually and statistically. On the other hand, the encoding time of the bitstream transcoded with motion vector reused is greatly different from the others. It is less than half an encoding time of the bitstream transcoded basically; this is due to removing the motion estimation procedure which is very time-consuming. While the motion estimation time of the bitstream transcoded with motion vector reused is zero, the others takes nearly one second for the motion estimation. It shows that the computational complexity of the bitstream transcoded with motion vector reused is greatly lower than that of the basic transcoder 6. Conclusion Reducing the computational complexity is important issue for designing the tranascoder in order to apply for more various applications. As a kind of the spatial-domain transcoding, the trasncoder with motion vector resued provides the powerful function that can reduce the total transcoding time. For this purpose, the motion information is delivered from the decoder module to the encoder module through the temporary buffers. Such motion information can be classified as five components such as macroblock type, motion vector mode, predicting direction, reference indices, and motion vectors. These data should be updated whenever each macroblock is decoded and encoded newly. The experiments show that the spatial domain trasncoder with motion vector resued has the same performance as the simple transcoder. Nevertheless, it has lower computational complexity than the simple transcoder. The future work is the development of frequency domain transcoder which may be faster and effective than spatial domain trasncoders.