Audio data structuring in cook codec
From rm file header, the following are some of parameters of interest to an audio decoder :
The mystery of frames, packets and sub-packets
In cook, a packet is a logical unit for storing audio frames. One packet typically contains multiple mixed frames, which rm calls sub_packets.
For almost any rm audio file, block_align == avg_packet_size, which is also synonymous to frame_size in rm header. The 'regular' audio frame, that is an audio buffer which could be sent to a decoder, is called sub_packet. In this context then, rm's frame_size is the size of one logical unit of multiple frames, and sub_packet_size is the size of a regular audio frame.
As stated in the previous paragraph, in a rm file, the value of block_align is equal to frame_size or avg_packet_size which is the size of one unit of packed frames. That's not the exact case in cook, however. For cook, block_align == sub_packet_size, which is the size of an actual audio frame. This has to be done manually though, an rm header just provides the values of the parameters and a parser has to handle the rest. This means that the parser would check a file to see if it contains cook audio, and then assign the value of sub_packet_size to block_align.
This is described in ffmpeg as a 'descrambling parameter'. After packing the frames (sub_packets) into packets, the packets are further packed into into scrambling units, each containing a sub_packet_h multiple of packets
not sub_packets. So for a parser to construct proper audio frames that the decoder could handle, it should first loop through the packets 'descrambling them'. For this process, the parser has to determine the position of each audio frame in the scrambling unit according to a crazy mathematical formula. Luckily the ffmpeg developers were capable of figuring out this formula, which is :
- sps = sub_packet_size;
- h = sub_packet_h;
- x = the position of the current frame in its parent packet;
- y = sub_packet_count; a sub_packet counter for each scrambling unit.
After constructing one scrambling unit, audio frames are then sent to the decoder. The decoder takes in an input buffer of uint8_t* and produces an output buffer of int16_t* .
Copyright © by the contributing authors.