diff doc/TFO-xform/HRv1 @ 35:0979407719f0

doc/TFO-xform/HRv1: article written
author Mychaela Falconia <falcon@freecalypso.org>
date Mon, 02 Sep 2024 07:32:09 +0000
parents
children
line wrap: on
line diff
--- /dev/null	Thu Jan 01 00:00:00 1970 +0000
+++ b/doc/TFO-xform/HRv1	Mon Sep 02 07:32:09 2024 +0000
@@ -0,0 +1,147 @@
+HRv1: relation between regular end decoder and TFO transform
+============================================================
+
+The reference decoder source published by ETSI in GSM 06.06 exhibits an almost
+modular design: the Rx DTX handler front-end is almost a separable piece.
+Breaking it down more precisely, we can make these observations:
+
+0) Most aspects of bad frame handling and comfort noise generation are done by
+   generating new coded speech parameters, such that the output of those
+   algorithms can be packaged into new HRv1 codec frames to be sent to a distant
+   decoder.  There are only two exceptions to this modularity:
+
+1) Handling of unreliable speech frames (BFI=0 UFI=1 in speech rather than CN
+   state) has a modular and a non-modular aspect:
+
+   1a) Modular aspect: if R0 increment from the last good frame to the
+       unreliable frames exceeds a certain threshold, UFI is turned into BFI,
+       which is then handled in a fully modular fashion.
+
+   1b) Non-modular aspect: if the R0 increment does not meet the threshold for
+       turning UFI into BFI but meets another slightly lower threshold, a flag
+       is set that is passed into the guts of the speech decoder.  That flag
+       effects speech muting on the decoder output level.
+
+2) GSM 06.22 section 6.2 (Comfort noise generation and updating) says in the
+   very last sentence:
+
+   "When updating the comfort noise parameters (frame energy and LPC
+    coefficients), these parameters shall be interpolated over the SID update
+    period to obtain smooth transitions."
+
+   Note the change in language: the corresponding spec for FRv1 says "should
+   preferably", but the HRv1 spec says "shall".  Furthermore, the bit-exact
+   implementation in the reference C code is considered normative in this
+   aspect, and is exercised by the test sequences of GSM 06.07.
+
+   This CN interpolation aspect is non-modular: R0 and the set of LPC
+   coefficients are decoded from bit parameters into linear form when CN frames
+   (initial and updates) are received, interpolation is done on this linear
+   form, and the interpolated values are passed to the main body of the speech
+   decoder.
+
+Based on these observations, we can conclude that if we wish to detach this
+reference Rx DTX handler for HRv1 from the reference decoder and make it into
+an implementation of TFO transform for this codec, we have to solve two
+problems:
+
+1) Decide how to handle those UFI frames that aren't being turned into BFI;
+
+2) Decide how to handle R0 and LPC parameters during CN insertion.
+
+Nokia TCSM2 TRAU implementation
+===============================
+
+Now that we have a working historical bank-of-TRAUs apparatus in our lab, let's
+take a look at how this vendor (Nokia) implemented the TFO transform for HRv1
+in their TRAU.  Here are our findings:
+
+* Handling of BFI=1 frames in speech state (not in DTX) exhibits a
+  simplification relative to GSM 06.06 reference code.  The reference code
+  checks to see if the last saved frame and the received errored frame have the
+  same voiced vs unvoiced mode: if this mode matches, codevector parameters are
+  taken from the errored frame, otherwise the last saved frame is regurgitated
+  without taking any bits from the errored frame.  Nokia's TFO transform always
+  does the latter (no bits are taken from the errored frame) irrespective of
+  voiced vs unvoiced mode matching or not.
+
+* Aside from this just-described simplification, all other aspects of BFI=1
+  handling for speech frames appear to match the reference code.
+
+* UFI handling appears to have been taken out altogether, even the part that
+  "upgrades" UFI to BFI when R0 increment is huge appears to have been omitted.
+  I fed a test sequence from TFO side that has a good speech frame with R0=2
+  followed by a UFI frame with R0=31, and the TRAU happily passed the latter
+  frame (now treated as perfectly good) to the DL output.
+
+* Comfort noise generation (DTXd=0) is done exactly as the reference code would
+  do it, except that neither R0 nor LPC parameters are interpolated.  During
+  each CN output interval between SID updates, R0 and LPC parameters in every
+  emitted CN frame are exactly equal to those received in the most recent SID
+  frame, as simple as that.  When a new SID update comes in, the change in
+  emitted R0 and LPC is abrupt.
+
+* The lost SID criterion for CN muting appears to be slightly different between
+  Nokia's TFO implementation and my reading of the spec and the reference C
+  code.  My interpretation of GSM 06.22 spec sections 5.2.3 and 5.2.4 is that
+  unlike FR and EFR, in the case of HR codec the second lost SID (second
+  occurrence of BFI instead of SID update in TAF position) does _not_ trigger
+  CN muting; instead this muting is supposed to kick in on the _third_ lost SID
+  occurrence.  (The difference in the spec was likely motivated by TAF positions
+  occurring every 240 ms with HR instead of every 480 ms with FR & EFR.)  My
+  reading of the reference C code agrees with my reading of the spec - yet
+  Nokia's TFO implementation initiates CN muting in the frame following the
+  second lost SID, not third.
+
+* Aside from the criterion for its initiation, the actual CN muting logic
+  behaves exactly like the reference C code: R0 is decremented by 2 on each
+  output frame following the TAF that initiates this sequence, and once R0
+  reaches 0, it stays there while this zero-magnitude CN output continues
+  indefinitely.
+
+* With DTXd=1 CN output is replaced with repeated retransmission of the same
+  SID whose parameters would have been used for non-interpolated CN with DTXd=0,
+  which also agrees with the rules of GSM 08.62 section 8.2.2 paragraph 2.
+
+* CN muting with DTXd=1 is implemented poorly.  The TRAU emits SID frames with
+  R0 decrementing by 2 on each frame just like how it does for generated CN
+  output that's in the process of being slowly muted, but this design is a poor
+  choice: because the BTS will only transmit one of every 12 SID update frames
+  and the TRAU has no way of knowing which SID will be transmitted, slow
+  decrement cadence on SID frames themselves (not on CN output) makes no sense.
+
+Thoughts for Themyscira implementation
+======================================
+
+Prior to getting Nokia TCSM2 working in our lab and being able to experiment
+with this TRAU, when I was contemplating the idea of potentially implementing
+TFO transform for HRv1 in Themyscira libraries, my main trepidation was how to
+produce comfort noise in the form of "speech" parameter output.  For endpoint
+decoders GSM 06.22 prescribes a bit-exact algorithm with interpolation, but
+that smoothly interpolated CN cannot be readily expressed in terms of parameter
+bits that can be packed into a new HRv1 codec frame.  I thought about
+requantizing the interpolated LPC reflection coefficients on every CN output
+frame, using the same computationally intensive vector quantization algorithm
+as in speech encoding - but because I am not an expert in codec design, it is
+not obvious to me whether or not such approach would produce good results.
+
+However, seeing that Nokia got away with simply passing R0 and LPC parameters
+along from incoming SID frames to CN output without any interpolation or other
+transformation gives us a huge confidence boost - if Nokia did it, so can we!
+This approach is of course simple, and yields itself readily to elegant
+implementation.
+
+Seeing that Nokia got away with effectively discarding UFI in their TFO
+transform is also a confidence boost - once again if Nokia did it, so can we.
+I plan on keeping the logic that "upgrades" UFI to BFI under certain conditions
+(not sure why Nokia omitted it), but the effect of potentially muting speech in
+the guts of the decoder (past parameter-level manipulation) is not really
+feasible to implement in a TFO transform.
+
+Finally, regarding the logic that takes codevector parameters from errored
+(BFI) frames when the voicing mode matches between the last saved frame and the
+errored frame, the logic that exists in the reference C code but not in Nokia's
+TFO transform: I plan on keeping this logic in our version, but Nokia's approach
+will come in handy for handling BFI-no-data frames, a condition that does not
+exist in TDM-based Abis transport or in TFO, but does unfortunately exist in
+IP-based GSM RAN.