view doc/AMR-library-API @ 479:616b7ba1135b

doc/AMR-library-API: document AMR-EFR hybrid decoder
author Mychaela Falconia <falcon@freecalypso.org>
date Sun, 19 May 2024 22:22:40 +0000
parents 936a08cc73ce
children 332397bc80aa
line wrap: on
line source

Libtwamr general usage
======================

The external public interface to Themyscira libtwamr consists of a single
header file <tw_amr.h>; it should be installed in some system include
directory.

The dialect of C used by all Themyscira GSM codec libraries is ANSI C (function
prototypes), const qualifier is used where appropriate, and the interface is
defined in terms of <stdint.h> types; <tw_amr.h> includes <stdint.h>.

Public #define constant definitions
===================================

Libtwamr public API header file <tw_amr.h> defines these constants:

#define	AMR_MAX_PRM		57	/* max. num. of params      */
#define	AMR_IETF_MAX_PL		32	/* max bytes in RFC 4867 frame */
#define	AMR_IETF_HDR_LEN	6	/* .amr file header bytes */
#define	AMR_COD_WORDS		250	/* # of words in 3GPP test seq format */

Explanation:

* AMR_MAX_PRM is the maximum number of broken-down speech parameters in the
  highest 12k2 mode of AMR; this definition is needed for struct amr_param_frame
  covered later in this document.

* AMR_IETF_MAX_PL is the size of the output buffer that must be provided for
  amr_frame_to_ietf(), and also most commonly the size of the staging buffer
  which most applications will likely use for gathering the input to
  amr_frame_from_ietf().

* AMR_IETF_HDR_LEN is the size of amr_file_header_magic[] public const datum
  covered later in this document, and this constant will also be needed by any
  application that needs to read or write the fixed header at the beginning of
  .amr files.

* AMR_COD_WORDS is the number of 16-bit words in one encoded frame in 3GPP test
  sequence format (.cod); the public definition is needed for sizing the arrays
  used with amr_frame_to_tseq() and amr_frame_from_tseq() API functions.

Libtwamr enumerated types
=========================

Libtwamr public API header file <tw_amr.h> defines these 3 enums:

enum RXFrameType {
	RX_SPEECH_GOOD = 0,
	RX_SPEECH_DEGRADED,
	RX_ONSET,
	RX_SPEECH_BAD,
	RX_SID_FIRST,
	RX_SID_UPDATE,
	RX_SID_BAD,
	RX_NO_DATA,
	RX_N_FRAMETYPES		/* number of frame types */
};

enum TXFrameType {
	TX_SPEECH_GOOD = 0,
	TX_SID_FIRST,
	TX_SID_UPDATE,
	TX_NO_DATA,
	TX_SPEECH_DEGRADED,
	TX_SPEECH_BAD,
	TX_SID_BAD,
	TX_ONSET,
	TX_N_FRAMETYPES		/* number of frame types */
};

enum Mode {
	MR475 = 0,
	MR515,
	MR59,
	MR67,
	MR74,
	MR795,
	MR102,
	MR122,
	MRDTX
};

Rx and Tx frame types are as defined by 3GPP, and the numeric values assigned
to each type are the same as those used by the official TS 26.073 encoder and
decoder programs.  Note that Rx and Tx frame types are NOT equal!

enum Mode should be self-explanatory: it covers the 8 possible codec modes of
AMR, plus the pseudo-mode of MRDTX used for packing and format manipulation of
SID frames.

State allocation and freeing
============================

In order to use the AMR encoder, you will need to allocate an encoder state
structure, and to use the AMR decoder, you will need to allocate a decoder state
structure.  The necessary state allocation functions are:

struct amr_encoder_state *amr_encoder_create(int dtx, int use_vad2);
struct amr_decoder_state *amr_decoder_create(void);

struct amr_encoder_state and struct amr_decoder_state are opaque structures to
library users: you only get pointers which you remember and pass around, but
<tw_amr.h> does not give you full definitions of these structs.  As a library
user, you don't even get to know the size of these structs, hence the necessary
malloc() operation happens inside amr_encoder_create() and amr_decoder_create().
However, each structure is malloc'ed as a single chunk, hence when you are done
with it, simply call free() to relinquish each encoder or decoder state
instance.

amr_encoder_create() and amr_decoder_create() functions can fail if the malloc()
call inside fails, in which case the two libtwamr functions in question return
NULL.

The dtx argument to amr_encoder_create() is a Boolean flag represented as an
int; it tells the AMR encoder whether it should operate with DTX enabled or
disabled.  (Note that DTX is also called SCR for Source-Controlled Rate in some
AMR specs.)  The use_vad2 argument is another Boolean flag, also represented as
an int; it tells the AMR encoder to use VAD2 algorithm instead of VAD1.  It is
a novel feature of libtwamr in that both VAD versions are included and
selectable at run time; see AMR-library-desc article for the details.

State reset functions
---------------------

The state of an already-allocated AMR encoder or AMR decoder can be reset at
any time with these functions:

void amr_encoder_reset(struct amr_encoder_state *st, int dtx, int use_vad2);
void amr_decoder_reset(struct amr_decoder_state *st);

Note that the two extra arguments to amr_encoder_reset() are the same as the
arguments to amr_encoder_create() - the reset operation is complete.
amr_encoder_create() is a wrapper around malloc() followed by
amr_encoder_reset(), and amr_decoder_create() is a wrapper around malloc()
followed by amr_decoder_reset().

Using the AMR encoder
=====================

To encode one 20 ms audio frame per AMR, call amr_encode_frame():

void amr_encode_frame(struct amr_encoder_state *st, enum Mode mode,
			const int16_t *pcm, struct amr_param_frame *frame);

You need to provide an encoder state structure allocated earlier with
amr_encoder_create(), the selection of which codec mode to use, and a block of
160 linear PCM samples.  Only modes MR475 through MR122 are valid for 'mode'
argument to amr_encode_frame(); MRDTX is not allowed in this context.

The output from amr_encode_frame() is written into this structure:

struct amr_param_frame {
	uint8_t	type;
	uint8_t	mode;
	int16_t	param[AMR_MAX_PRM];
};

This structure is public, but it is defined by libtwamr (not by any external
standard), and it is generally intended to be an intermediate stage before
output encoding.  Library functions exist for generating 3 output formats: 3GPP
AMR test sequence format, IETF RFC 4867 format, and AMR-EFR hybrid.

Native encoder output
---------------------

The output structure is filled as follows:

type:	Set to one of TX_SPEECH_GOOD, TX_SID_FIRST, TX_SID_UPDATE or TX_NO_DATA,
	as defined by 3GPP.  The last 3 are possible only when the encoder
	operates with DTX enabled.

mode:	One of MR475 through MR122, same as the 'mode' argument to
	amr_encode_frame().

param:	Array of codec parameters, from 17 to 57 of them for modes MR475 through
	MR122 in the case of TX_SPEECH_GOOD output, or 5 parameters for MRDTX
	in the case of TX_SID_FIRST, TX_SID_UPDATE or TX_NO_DATA DTX output.

3GPP AMR test sequence output
-----------------------------

The following function exists to convert the above encoder output into the test
sequence format which 3GPP defined for AMR, the insanely inefficient one with
250 (AMR_COD_WORDS) 16-bit words per frame:

void amr_frame_to_tseq(const struct amr_param_frame *frame, uint16_t *cod);

This function allows libtwamr encoder to be tested for correctness against the
set of test sequences in 3GPP TS 26.074.  The output is in the local machine's
native byte order.

RFC 4867 output
---------------

To turn libtwamr encoder output into an octet-aligned RFC 4867 single-frame
payload or storage-format frame (ToC octet followed by speech or SID data, but
no CMR payload header), call this function:

unsigned amr_frame_to_ietf(const struct amr_param_frame *frame, uint8_t *bytes);

The output buffer must have room for up to 32 bytes (AMR_IETF_MAX_PL); the
return value is the actual number of bytes used.  The shortest possible output
is 1 byte in the case of TX_NO_DATA; the longest possible output is 32 bytes in
the case of TX_SPEECH_GOOD, mode MR122.

Additional notes regarding output conversion functions
------------------------------------------------------

The struct amr_param_frame that is input to amr_frame_to_ietf() or
amr_frame_to_tseq() is expected to be a valid output from amr_encode_frame().
These output conversion functions contain no guards against invalid input
(anything that cannot occur in the output from amr_encode_frame()), and are
thus allowed to segfault or corrupt memory etc if fed such invalid input.

This lack of guard is justified in the present instance because struct
amr_param_frame is not intended to ever function as an external interface to
untrusted entities, instead this struct is intended to be only an intermediate
staging buffer between the call to amr_encode_frame() and an immediately
following call to one of the provided output conversion functions.

AMR-EFR hybrid encoder
======================

To use libtwamr as an AMR-EFR hybrid encoder, follow these constraints:

* 'dtx' argument must be 0 (no DTX) on the call to amr_encoder_create() or
  amr_encoder_reset() that establishes the state for the encoder session.

* 'mode' argument to amr_encode_frame() must be MR122 on every frame.

After getting struct amr_param_frame out of amr_encode_frame(), call one of
these functions to generate the correct EFR DHF under the right conditions:

void amr_dhf_subst_efr(struct amr_param_frame *frame);
void amr_dhf_subst_efr2(struct amr_param_frame *frame, const int16_t *pcm);

Both functions check if the encoded frame is MR122 DHF (type equals
TX_SPEECH_GOOD, mode equals MR122, param array equals the fixed bit pattern of
MR122 DHF), and if so, overwrite param[] array in the structure with the
different bit pattern of EFR DHF.  The difference between the two functions is
that amr_dhf_subst_efr() performs the just-described substitution
unconditionally, whereas amr_dhf_subst_efr2() applies this substitution only if
the PCM input is EHF.  The latter function matches the observed behavior of
T-Mobile USA, but perhaps some others implemented the simpler logic equivalent
to our first function.

After this transformation, call EFR_params2frame() from libgsmefr (see
EFR-library-API) with param[] array in struct amr_param_frame as input.

Using the AMR decoder: native interface
=======================================

The internal native form of the stateful AMR decoder engine is:

void amr_decode_frame(struct amr_decoder_state *st,
			const struct amr_param_frame *frame, int16_t *pcm);

The input frame is given as struct amr_param_frame, same structure as is used
for the output of the encoder.  However, the required input to
amr_decode_frame() is different from amr_encode_frame() output:

* The 'type' member of the struct must be a code from enum RXFrameType, *not*
  enum TXFrameType!

* All 3GPP-defined Rx frame types are allowed.

* The 'mode' member of the input struct is ignored if the Rx frame type is
  RX_NO_DATA, but must be valid for every other frame type.

If frame->type is not RX_NO_DATA, frame->mode is interpreted as follows:

* The 3 least significant bits (mask 0x07) are taken to indicate the codec mode
  used for this frame;

* The most significant bit (mask 0x80) has meaning only if the mode is MR122
  and frame->type is RX_SPEECH_GOOD.  Under these conditions, if this bit is
  set, the DHF check is modified to match against the bit pattern of EFR DHF
  instead of regular MR122 DHF.

amr_decode_frame() contains no guards against invalid (undefined) frame types
in frame->type, or against any of the codec parameters being out of range.
struct amr_param_frame coming into this function must come only from trusted
sources inside the application program, usually from one of the provided input
format conversion functions.

Decoder homing frame check
--------------------------

The definition of AMR decoder per 3GPP includes two mandatory checks for the
possibility of the input frame being one of the defined per-mode decoder homing
frames (DHFs): one check at the beginning of the decoder, checking only up to
the first subframe and acting only when the current state is homed, and the
second check at the end of the decoder, checking all parameters (the full frame)
and resetting the decoder on match.

This DHF check operation, called from those two places in the stateful decoder
as just described, is factored out into its own function that is exported as
part of the public API:

int amr_check_dhf(const struct amr_param_frame *frame, int first_sub_only);

struct amr_param_frame needs to be passed to amr_check_dhf() as if it was
amr_decode_frame(); the latter function in fact calls amr_check_dhf() on its
input.  The Boolean flag argument (first_sub_only) tells the function to check
only to the end of the first subframe if nonzero, or check the entire frame if
zero.  The return value is 1 if the input matches DHF, 0 otherwise.

frame->type must be RX_SPEECH_GOOD for the frame to be a DHF candidate, and the
interpretation of frame->mode, including the special mode of matching against
EFR DHF, is implemented in this function.

Using the AMR decoder: input preparation
========================================

Stateless utility functions are provided for preparing decoder inputs,
converting from RFC 4867 or 3GPP test sequence format into the internal form
described above.

Decoding RFC 4867 input
-----------------------

If the entire RFC 4867 frame (read from .amr storage format or received in RTP
as an octet-aligned payload) is already in memory, decode it with this function:

int amr_frame_from_ietf(const uint8_t *bytes, struct amr_param_frame *frame);

The string of bytes input to this function must begin with the ToC octet.  Out
of this ToC octet, only bits falling under the mask 0x7C (FT and Q bit fields)
are checked.  The remaining 3 bits are not checked: in the case of .amr storage
format, RFC 4867 describes these bits as "padding" (P bits) and stipulates that
they MUST be ignored by readers.  However, in the case of RTP payloads received
in a live session, the uppermost bit of the ToC octet becomes F rather than P,
and it is the responsibility of the application to ensure that F=0: multiframe
payloads are NOT supported.

FT in the input frame may be [0,7] (MR475 through MR122), 8 (MRDTX) or 15
(AMR_FT_NODATA).  In all of these cases amr_frame_from_ietf() succeeds and
returns 0 to indicate so; the resulting struct amr_param_frame is then good to
be passed to amr_decode_frame().  OTOH, if FT falls into the invalid range of
[9,14], amr_frame_from_ietf() returns -1 to indicate invalid input.

Applications that read from a .amr file will need to read just the ToC (aka
frame header) octet and decode it to determine how many additional octets need
to be read to absorb one frame.  Similarly, RTP applications may need to
validate incoming payloada by cross-checking between the FT indicated in the
ToC octet and the received payload length.  Both applications can use this
function:

int amr_ietf_grok_first_octet(uint8_t fo);

The argument is the first octet, and the function only considers the FT field
thereof.  The return value is:

-1 for invalid FT [9,14]
0 for FT=15 (the ToC octet is the entirety of the payload)
>0 for valid FT [0,8], indicating the number of additional bytes to be read

Decoding 3GPP test sequence input
---------------------------------

To decode a frame from 3GPP .cod file format, call this function:

int amr_frame_from_tseq(const uint16_t *cod, int use_rxtype,
			struct amr_param_frame *frame);

The argument 'use_rxtype' should be 1 if the input uses Rx frame types (enum
RXFrameType) or 0 if it uses Tx frame types (enum TXFrameType); this argument
directly corresponds to -rxframetype command line option in the reference
decoder program from 3GPP.

Unlike raw amr_decode_frame(), amr_frame_from_tseq() does guard against invalid
input.  The return value from this function is:

0 means the input was good and the output is good to pass to amr_decode_frame();
-1 means the frame type field in the input is invalid;
-2 means the mode field in the input is invalid.

Frame type conversion
---------------------

The operation of mapping from enum TXFrameType to enum RXFrameType, optionally
but very commonly invoked from amr_frame_from_tseq(), is factored out into its
own function, exported as part of the public API:

int amr_txtype_to_rxtype(enum TXFrameType tx_type, enum RXFrameType *rx_type);

The return value is 0 if tx_type is valid and *rx_type has been filled
accordingly, or -1 if tx_type is invalid.

AMR-EFR hybrid decoder
======================

To use libtwamr as an AMR-EFR hybrid decoder, follow these steps:

* Turn the input frame from EFR RTP format into array-of-parameters form with
  libgsmefr function EFR_frame2params(), writing the output into the param[]
  array in struct amr_param_frame.

* Set 'type' in the struct to RX_SPEECH_GOOD for good frames, RX_SPEECH_BAD for
  BFI with payload bits present, or RX_NO_DATA for BFI without payload.

* Set 'mode' to 0x87 always, indicating a variation of MR122 with EFR DHF
  instead of the different native MR122 DHF.

* Call amr_decode_frame() with this input.

Fundamental limitation: the AMR decoder in libtwamr, derived from 3GPP AMR
reference source and only minimally extended to support EFR DHF, does not
support EFR SID frames.  Therefore, the option of AMR-EFR hybrid emulation via
libtwamr is limited to lab experiments where the input to the decoder can be
ensured to be SID-free, and is not suitable for production use.  See
AMR-EFR-philosophy article for more information.