diff doc/TFO-xform/FRv1 @ 34:35d38348c880

doc/TFO-xform/FRv1: article written
author Mychaela Falconia <falcon@freecalypso.org>
date Sun, 01 Sep 2024 06:28:35 +0000
parents
children
line wrap: on
line diff
--- /dev/null	Thu Jan 01 00:00:00 1970 +0000
+++ b/doc/TFO-xform/FRv1	Sun Sep 01 06:28:35 2024 +0000
@@ -0,0 +1,237 @@
+Rx DTX handler situation in FRv1
+================================
+
+Before we address the question of how one should implement TFO transform for
+FRv1, let's begin with a more basic question: how does the Rx DTX handler (the
+"front end" part of the speech decoder in an end-terminal implementation) work
+in FRv1?  In both HRv1 and EFR, error-free comfort noise generation functions
+of this Rx DTX handler are normative per the specs at bit-exact level, while
+error handling functions are specified only as a non-normative example - and
+the supplied reference C sources implement the full Rx DTX handler (both the
+normative part and the "example" part) as an inseparable part of the speech
+decoder.  But not so for FRv1: there is no reference C source and there are no
+bit-exact definitions for any part of Rx DTX handler logic.  All Rx DTX handler
+functions are defined only in English prose (no code), and even in the most
+normative parts the language used in the specs is quite loose.
+
+Based on what is specified (verbally, loosely) in GSM 06.11 and 06.12, there
+are two principal ways in which an Rx-ECU-capable, Rx-DTX-capable FRv1 speech
+decoder can be implemented:
+
+Fully modular approach: the basic GSM 06.10 decoder block (which is bit-exact,
+but cannot handle BFIs or SID frames) remains absolutely unmodified, while the
+Rx DTX handler (which includes both error concealment and CN generation) is
+implemented as a modular piece, with an "honest-to-god" 260-bit 06.10 frame
+interface between the two blocks.
+
+Non-modular approach: the Rx DTX handler and the 06.10-based speech decoder are
+integrated more tightly, and there is no possible stream of "pure" 06.10 codec
+frames that would produce the same bit-exact PCM output as the actually
+implemented "full decoder" with the built-in Rx DTX handler.
+
+Cursory reading of GSM 06.11 and 06.12 specs strongly suggests that they call
+for the fully modular approach as defined above.  However, because neither spec
+includes any bit-exact definitions, there is no formal stipulation that the
+modular approach shall be used - it is entirely conceivable that someone could
+implement a non-modular approach, and they would still be spec-compliant.
+
+Why would anyone implement the non-modular approach when the fully modular one
+seems much simpler?  After all, the bit-exact basic 06.10 decoder already
+exists - surely it is easier to build a separate front-end to it than dig into
+the guts of that pre-existing box?  There is, however, one aspect that could
+sway implementors toward the non-modular approach: interpolation of CN
+parameter updates during prolonged DTX pauses.  GSM 06.12 (or rather its latest
+incarnation as 3GPP TS 46.012) says, at the very end of section 6.1:
+
+"When updating the comfort noise, the parameters above should preferably be
+ interpolated over a few frames to obtain smooth transitions."
+
+This kind of CN parameter interpolation is mandatory in the newer HRv1 and EFR
+codecs where the CN generator function is defined in bit-exact terms, hence it
+makes sense that some implementors may have chosen to back-port the same feature
+to FRv1.
+
+CN parameter interpolation: deeper analysis of the problem
+==========================================================
+
+How does this interpolation feature affect the choice of modular or non-modular
+design?  As a non-expert on the subject of codec design, I am not able to say
+authoritatively if it is possible to implement the feature of CN parameter
+interpolation (and do it well) while staying with the fully modular design in
+which the basic 06.10 decoder block remains absolutely unchanged, or if high-
+quality implementation of this feature would require foregoing the modularity
+and moving the CN-specific interpolation function somewhere inside that block,
+e.g., between the output of GSM 06.10 section 4.2.8 and the input to section
+4.2.9, as referenced from section 4.3.3 for the decoder.
+
+We can, however, look at how ETSI handled this problem in other codecs for
+which they did mandate CN parameter interpolation in bit-exact form.  HRv1 is
+the best point of comparison in this regard because of this detail: the Rx DTX
+handler front-end part of the official bit-exact HRv1 decoder (delivered as C
+source this time, not just verbiage) is _almost_ modular, i.e., one could
+_almost_ detach it into a modular piece whose output could be fed to the
+decoder as a new "cleaned up" stream of HRv1 codec frames.  Where is the
+"almost" part?  Answer: interpolation of CN parameters!  When HRv1 decoder is
+in CN insertion state, it dequantizes R0 and LPC parameters from SID frames
+only when initial and update frames come in - but when it generates the actual
+CN between those updates, it performs smooth linear interpolation on the decoded
+parameters, *without* requantizing them into something that can be retransmitted
+as new HRv1 codec frames representing the CN.
+
+Once again, as a non-expert on the subject of codec design, I am not able to say
+authoritatively if the same approach that was prescribed by ETSI for HRv1 would
+also work for FRv1, or if CN parameter interpolation for FRv1 can be done well
+by requantizing the interpolated parameters for each individual CN output frame
+and feeding them to a strictly unmodified 06.10 decoder block.  It is the case,
+however, that there is no pre-existing implementation available to us which we
+can look at that does CN parameter interpolation for FRv1 - the TFO transform
+in Nokia TCSM2 does _not_ interpolate - hence without a reference to look at,
+this optional feature is a can of worms which we should stay away from.
+
+Front-end part of the speech decoder and TFO transform
+======================================================
+
+If the party who implemented the regular end-decoder for FRv1 chose the fully
+modular approach, either by disregarding the call for interpolation of CN
+parameters (the spec language is "should preferably", rather than "shall") or
+by requantizing the interpolated parameters on each CN output frame, then a
+corresponding implementation of TFO transform for non-DTXd operation becomes
+trivial: the modularized Rx DTX handler front-end can also serve unchanged as
+the TFO transform!
+
+This just-described situation holds for the current Themyscira Wireless
+implementation of FRv1 codec, named libgsmfr2.  (The 2 in the library name
+refers to the major version of library API and dependency structure; the codec
+it implements is still FRv1.)  Specifically:
+
+* The full decoder implementation in libgsmfr2 follows the modular approach:
+  the front-end Rx DTX handler preprocessor feeds "cleaned up" FRv1 codec frames
+  to an unmodified GSM 06.10 decoder.
+
+* No interpolation is done on CN parameters: as soon as each SID update comes
+  in, the new parameters are used immediately for all generated CN frames.
+
+The preprocessor part of libgsmfr2 is thus already suitable to serve as a TFO
+transform for FRv1.  However, before formally adopting it as such, I have had a
+long-standing desire to see how this function was implemented by other vendors;
+particularly, how it's been implemented in real historical TRAUs.
+
+Nokia TCSM2 TRAU implementation
+===============================
+
+As of 2024-08, we finally have a working bank-of-TRAUs apparatus in our lab:
+Nokia TCSM2.  This TRAU implements TFO for FRv1, HRv1 and EFR, hence we finally
+got the ability to see how this vendor (Nokia) implemented the elusive TFO
+transform.
+
+Here are our findings:
+
+Error concealment function
+--------------------------
+
+Themyscira implementation is based on the "example solution" of TS 46.011
+chapter 6; Nokia's implementation appears to be very similar, with only a few
+visible differences:
+
+* When the ECU enters the state of "speech muting" (after the first speech-state
+  BFI for which the last good speech frame is simply repeated), instead of
+  decrementing each of the 4 Xmaxcr numbers by 4, it decrements them by 11,
+  thereby producing noticeably faster muting than what the spec calls for.
+
+* The state of emitting fixed silence frames is entered not after the
+  algorithmically-muted frame in which the lowest Xmaxcr reached 0 (my reading
+  of the "example solution" in the spec), but after the state of algorithmic
+  muting (decrementing Xmaxcr's by 11 each time) persisted for exactly 5 frames.
+  If the original speech frame had its highest Xmaxcr equal to 63, the last
+  algorithmically muted frame before fixed silence frames will have 8 in that
+  Xmaxcr; if all starting Xmaxcr numbers were low, there will be 5 frames with
+  all zeros in Xmaxcr, random Mcr and other parameters unchanged before the
+  switch to fixed silence frames.
+
+Nokia's TFO transform exhibits additional logic whereby the first good speech
+frame after prolonged BFIs has its highest Xmaxcr reduced (but not messed with
+otherwise); if that good speech frame is again followed by BFIs, the ECU goes
+back to silence frame output right away - or at least that's what we saw in one
+experiment.  This aspect has not been studied in detail.
+
+Comfort noise generation (DTXd=0)
+---------------------------------
+
+The comfort noise output from Nokia's TFO transform generally agrees with my
+reading of GSM 06.12 spec section 6.1, the section that describes CN generation.
+However, the following parts were surprising/unexpected:
+
+1) The TRAU reacts to SID updates with a delay of 24 frames.  Suppose that frame
+   #20 in the input is the initial SID, frame #24 (TAF position) is the first
+   SID update, frame #48 is the next SID update and so forth.  In the output
+   from Nokia's TFO transform, the updated parameters from input frame #24 will
+   appear in output frame #48, those from input frame #48 will appear in output
+   frame #72 and so forth.  There is no sensible explanation for this extraneous
+   buffering delay; at first I thought it was an artifact of the CN parameter
+   interpolation mechanism, but:
+
+2) No interpolation is done!  I deliberately constructed input sequences in
+   which each subsequent SID update has wildly different parameters from the
+   previous, and when the changeover does happen in the DL output after the
+   strange delay of 24 frames, the change is immediate and abrupt.
+
+CN muting after two missed SID updates (BFI received instead of SID in the TAF
+position twice in a row) is done the same way as speech muting: the TRAU emits
+exactly 5 frames with decreasing Xmaxcr (same decrement by 11), then switches
+to emitting fixed silence frames.
+
+SID forwarding (DTXd=1)
+-----------------------
+
+When DTXd is enabled on the destination call leg and the input frame stream to
+the TFO transform includes SID frames (considering only valid SID for now), the
+transform does not generate comfort noise - instead received SID frames are
+passed through to call leg B DL, unless they are invalid SID or the muting
+mechanism has to kick in because of lost SID updates.
+
+Nokia's implementation does pass valid SID frames through (I haven't tested
+invalid SID yet), but it applies the same weird delay of 24 frames to the
+switchover point for each update as it does when generating CN for DTXd=0.
+
+However, the part where Nokia's TFO transform (at least for FRv1) is plain
+broken is CN muting in the case of lost SID updates.  Here is what it does: it
+decrements Xmaxcr by 4 (yes, by 4, not by 11) once every 24 frames (probably in
+each TAF position), such that if the level of CN was very high before channel
+breakdown, it will take up to 7.68 s before this CN is fully muted at the end
+receiver.
+
+GSM 06.12 section 5.4 says: "For the second lost SID frame, a muting technique
+shall be used on the comfort noise that will gradually decrease the output
+level, resulting in silencing of the output after a maximum of 320 ms."  The
+spec gives a maximum of 320 ms for total muting of CN, but with Nokia's TFO
+transform in DTXd=1 case, that maximum time is 7.68 s - spec requirement
+violated.
+
+Only TFO, or regular FRv1 decoder too?
+--------------------------------------
+
+How does the regular FRv1 speech decoder (the one that ultimately emits G.711)
+in Nokia TCSM2 TRAU implementation compare to what we've observed with their
+TFO transform?  Do they use a modular design where the regular decoder is a copy
+of the same TFO transform followed by a standard GSM 06.10 decoder block, or do
+they do something fancier?
+
+Unfortunately we have no realistic way to answer this question: Nokia chose to
+not implement the optional in-band homing mechanism for FRv1, thus we have no
+way to pass test sequences through the TRAU in the decoder direction and see if
+the output matches our hypothesis as to decoder logic.  Hence the TFO transform
+is the only part whose detailed behaviour we can realistically study in this
+TRAU.
+
+Take-away for Themyscira implementation
+=======================================
+
+My take-away points from the preceding examination of FRv1 TFO transform in
+Nokia TCSM2 are:
+
+* Our current Rx DTX handler front-end in libgsmfr2 is fine - Nokia's
+  implementation is not any fancier at least in the case of TFO.
+
+* Modularity is a good thing, and so is consistency.  There is nothing wrong
+  with using the same Rx DTX handler block both as our TFO transform and as the
+  front-end portion of the full decoder in end terminal operation.