view doc/TFO-xform/HRv1 @ 56:b32b644b7d96

d144/nokia-tcsm2-atrau.bin: captured A-TRAU output from Nokia TCSM2, fed with ul-input from Ater
author Mychaela Falconia <falcon@freecalypso.org>
date Wed, 25 Sep 2024 07:42:04 +0000
parents 0979407719f0
children
line wrap: on
line source

HRv1: relation between regular end decoder and TFO transform
============================================================

The reference decoder source published by ETSI in GSM 06.06 exhibits an almost
modular design: the Rx DTX handler front-end is almost a separable piece.
Breaking it down more precisely, we can make these observations:

0) Most aspects of bad frame handling and comfort noise generation are done by
   generating new coded speech parameters, such that the output of those
   algorithms can be packaged into new HRv1 codec frames to be sent to a distant
   decoder.  There are only two exceptions to this modularity:

1) Handling of unreliable speech frames (BFI=0 UFI=1 in speech rather than CN
   state) has a modular and a non-modular aspect:

   1a) Modular aspect: if R0 increment from the last good frame to the
       unreliable frames exceeds a certain threshold, UFI is turned into BFI,
       which is then handled in a fully modular fashion.

   1b) Non-modular aspect: if the R0 increment does not meet the threshold for
       turning UFI into BFI but meets another slightly lower threshold, a flag
       is set that is passed into the guts of the speech decoder.  That flag
       effects speech muting on the decoder output level.

2) GSM 06.22 section 6.2 (Comfort noise generation and updating) says in the
   very last sentence:

   "When updating the comfort noise parameters (frame energy and LPC
    coefficients), these parameters shall be interpolated over the SID update
    period to obtain smooth transitions."

   Note the change in language: the corresponding spec for FRv1 says "should
   preferably", but the HRv1 spec says "shall".  Furthermore, the bit-exact
   implementation in the reference C code is considered normative in this
   aspect, and is exercised by the test sequences of GSM 06.07.

   This CN interpolation aspect is non-modular: R0 and the set of LPC
   coefficients are decoded from bit parameters into linear form when CN frames
   (initial and updates) are received, interpolation is done on this linear
   form, and the interpolated values are passed to the main body of the speech
   decoder.

Based on these observations, we can conclude that if we wish to detach this
reference Rx DTX handler for HRv1 from the reference decoder and make it into
an implementation of TFO transform for this codec, we have to solve two
problems:

1) Decide how to handle those UFI frames that aren't being turned into BFI;

2) Decide how to handle R0 and LPC parameters during CN insertion.

Nokia TCSM2 TRAU implementation
===============================

Now that we have a working historical bank-of-TRAUs apparatus in our lab, let's
take a look at how this vendor (Nokia) implemented the TFO transform for HRv1
in their TRAU.  Here are our findings:

* Handling of BFI=1 frames in speech state (not in DTX) exhibits a
  simplification relative to GSM 06.06 reference code.  The reference code
  checks to see if the last saved frame and the received errored frame have the
  same voiced vs unvoiced mode: if this mode matches, codevector parameters are
  taken from the errored frame, otherwise the last saved frame is regurgitated
  without taking any bits from the errored frame.  Nokia's TFO transform always
  does the latter (no bits are taken from the errored frame) irrespective of
  voiced vs unvoiced mode matching or not.

* Aside from this just-described simplification, all other aspects of BFI=1
  handling for speech frames appear to match the reference code.

* UFI handling appears to have been taken out altogether, even the part that
  "upgrades" UFI to BFI when R0 increment is huge appears to have been omitted.
  I fed a test sequence from TFO side that has a good speech frame with R0=2
  followed by a UFI frame with R0=31, and the TRAU happily passed the latter
  frame (now treated as perfectly good) to the DL output.

* Comfort noise generation (DTXd=0) is done exactly as the reference code would
  do it, except that neither R0 nor LPC parameters are interpolated.  During
  each CN output interval between SID updates, R0 and LPC parameters in every
  emitted CN frame are exactly equal to those received in the most recent SID
  frame, as simple as that.  When a new SID update comes in, the change in
  emitted R0 and LPC is abrupt.

* The lost SID criterion for CN muting appears to be slightly different between
  Nokia's TFO implementation and my reading of the spec and the reference C
  code.  My interpretation of GSM 06.22 spec sections 5.2.3 and 5.2.4 is that
  unlike FR and EFR, in the case of HR codec the second lost SID (second
  occurrence of BFI instead of SID update in TAF position) does _not_ trigger
  CN muting; instead this muting is supposed to kick in on the _third_ lost SID
  occurrence.  (The difference in the spec was likely motivated by TAF positions
  occurring every 240 ms with HR instead of every 480 ms with FR & EFR.)  My
  reading of the reference C code agrees with my reading of the spec - yet
  Nokia's TFO implementation initiates CN muting in the frame following the
  second lost SID, not third.

* Aside from the criterion for its initiation, the actual CN muting logic
  behaves exactly like the reference C code: R0 is decremented by 2 on each
  output frame following the TAF that initiates this sequence, and once R0
  reaches 0, it stays there while this zero-magnitude CN output continues
  indefinitely.

* With DTXd=1 CN output is replaced with repeated retransmission of the same
  SID whose parameters would have been used for non-interpolated CN with DTXd=0,
  which also agrees with the rules of GSM 08.62 section 8.2.2 paragraph 2.

* CN muting with DTXd=1 is implemented poorly.  The TRAU emits SID frames with
  R0 decrementing by 2 on each frame just like how it does for generated CN
  output that's in the process of being slowly muted, but this design is a poor
  choice: because the BTS will only transmit one of every 12 SID update frames
  and the TRAU has no way of knowing which SID will be transmitted, slow
  decrement cadence on SID frames themselves (not on CN output) makes no sense.

Thoughts for Themyscira implementation
======================================

Prior to getting Nokia TCSM2 working in our lab and being able to experiment
with this TRAU, when I was contemplating the idea of potentially implementing
TFO transform for HRv1 in Themyscira libraries, my main trepidation was how to
produce comfort noise in the form of "speech" parameter output.  For endpoint
decoders GSM 06.22 prescribes a bit-exact algorithm with interpolation, but
that smoothly interpolated CN cannot be readily expressed in terms of parameter
bits that can be packed into a new HRv1 codec frame.  I thought about
requantizing the interpolated LPC reflection coefficients on every CN output
frame, using the same computationally intensive vector quantization algorithm
as in speech encoding - but because I am not an expert in codec design, it is
not obvious to me whether or not such approach would produce good results.

However, seeing that Nokia got away with simply passing R0 and LPC parameters
along from incoming SID frames to CN output without any interpolation or other
transformation gives us a huge confidence boost - if Nokia did it, so can we!
This approach is of course simple, and yields itself readily to elegant
implementation.

Seeing that Nokia got away with effectively discarding UFI in their TFO
transform is also a confidence boost - once again if Nokia did it, so can we.
I plan on keeping the logic that "upgrades" UFI to BFI under certain conditions
(not sure why Nokia omitted it), but the effect of potentially muting speech in
the guts of the decoder (past parameter-level manipulation) is not really
feasible to implement in a TFO transform.

Finally, regarding the logic that takes codevector parameters from errored
(BFI) frames when the voicing mode matches between the last saved frame and the
errored frame, the logic that exists in the reference C code but not in Nokia's
TFO transform: I plan on keeping this logic in our version, but Nokia's approach
will come in handy for handling BFI-no-data frames, a condition that does not
exist in TDM-based Abis transport or in TFO, but does unfortunately exist in
IP-based GSM RAN.