view doc/TFO-xform/FRv1 @ 56:b32b644b7d96

d144/nokia-tcsm2-atrau.bin: captured A-TRAU output from Nokia TCSM2, fed with ul-input from Ater
author Mychaela Falconia <falcon@freecalypso.org>
date Wed, 25 Sep 2024 07:42:04 +0000
parents 35d38348c880
children
line wrap: on
line source

Rx DTX handler situation in FRv1
================================

Before we address the question of how one should implement TFO transform for
FRv1, let's begin with a more basic question: how does the Rx DTX handler (the
"front end" part of the speech decoder in an end-terminal implementation) work
in FRv1?  In both HRv1 and EFR, error-free comfort noise generation functions
of this Rx DTX handler are normative per the specs at bit-exact level, while
error handling functions are specified only as a non-normative example - and
the supplied reference C sources implement the full Rx DTX handler (both the
normative part and the "example" part) as an inseparable part of the speech
decoder.  But not so for FRv1: there is no reference C source and there are no
bit-exact definitions for any part of Rx DTX handler logic.  All Rx DTX handler
functions are defined only in English prose (no code), and even in the most
normative parts the language used in the specs is quite loose.

Based on what is specified (verbally, loosely) in GSM 06.11 and 06.12, there
are two principal ways in which an Rx-ECU-capable, Rx-DTX-capable FRv1 speech
decoder can be implemented:

Fully modular approach: the basic GSM 06.10 decoder block (which is bit-exact,
but cannot handle BFIs or SID frames) remains absolutely unmodified, while the
Rx DTX handler (which includes both error concealment and CN generation) is
implemented as a modular piece, with an "honest-to-god" 260-bit 06.10 frame
interface between the two blocks.

Non-modular approach: the Rx DTX handler and the 06.10-based speech decoder are
integrated more tightly, and there is no possible stream of "pure" 06.10 codec
frames that would produce the same bit-exact PCM output as the actually
implemented "full decoder" with the built-in Rx DTX handler.

Cursory reading of GSM 06.11 and 06.12 specs strongly suggests that they call
for the fully modular approach as defined above.  However, because neither spec
includes any bit-exact definitions, there is no formal stipulation that the
modular approach shall be used - it is entirely conceivable that someone could
implement a non-modular approach, and they would still be spec-compliant.

Why would anyone implement the non-modular approach when the fully modular one
seems much simpler?  After all, the bit-exact basic 06.10 decoder already
exists - surely it is easier to build a separate front-end to it than dig into
the guts of that pre-existing box?  There is, however, one aspect that could
sway implementors toward the non-modular approach: interpolation of CN
parameter updates during prolonged DTX pauses.  GSM 06.12 (or rather its latest
incarnation as 3GPP TS 46.012) says, at the very end of section 6.1:

"When updating the comfort noise, the parameters above should preferably be
 interpolated over a few frames to obtain smooth transitions."

This kind of CN parameter interpolation is mandatory in the newer HRv1 and EFR
codecs where the CN generator function is defined in bit-exact terms, hence it
makes sense that some implementors may have chosen to back-port the same feature
to FRv1.

CN parameter interpolation: deeper analysis of the problem
==========================================================

How does this interpolation feature affect the choice of modular or non-modular
design?  As a non-expert on the subject of codec design, I am not able to say
authoritatively if it is possible to implement the feature of CN parameter
interpolation (and do it well) while staying with the fully modular design in
which the basic 06.10 decoder block remains absolutely unchanged, or if high-
quality implementation of this feature would require foregoing the modularity
and moving the CN-specific interpolation function somewhere inside that block,
e.g., between the output of GSM 06.10 section 4.2.8 and the input to section
4.2.9, as referenced from section 4.3.3 for the decoder.

We can, however, look at how ETSI handled this problem in other codecs for
which they did mandate CN parameter interpolation in bit-exact form.  HRv1 is
the best point of comparison in this regard because of this detail: the Rx DTX
handler front-end part of the official bit-exact HRv1 decoder (delivered as C
source this time, not just verbiage) is _almost_ modular, i.e., one could
_almost_ detach it into a modular piece whose output could be fed to the
decoder as a new "cleaned up" stream of HRv1 codec frames.  Where is the
"almost" part?  Answer: interpolation of CN parameters!  When HRv1 decoder is
in CN insertion state, it dequantizes R0 and LPC parameters from SID frames
only when initial and update frames come in - but when it generates the actual
CN between those updates, it performs smooth linear interpolation on the decoded
parameters, *without* requantizing them into something that can be retransmitted
as new HRv1 codec frames representing the CN.

Once again, as a non-expert on the subject of codec design, I am not able to say
authoritatively if the same approach that was prescribed by ETSI for HRv1 would
also work for FRv1, or if CN parameter interpolation for FRv1 can be done well
by requantizing the interpolated parameters for each individual CN output frame
and feeding them to a strictly unmodified 06.10 decoder block.  It is the case,
however, that there is no pre-existing implementation available to us which we
can look at that does CN parameter interpolation for FRv1 - the TFO transform
in Nokia TCSM2 does _not_ interpolate - hence without a reference to look at,
this optional feature is a can of worms which we should stay away from.

Front-end part of the speech decoder and TFO transform
======================================================

If the party who implemented the regular end-decoder for FRv1 chose the fully
modular approach, either by disregarding the call for interpolation of CN
parameters (the spec language is "should preferably", rather than "shall") or
by requantizing the interpolated parameters on each CN output frame, then a
corresponding implementation of TFO transform for non-DTXd operation becomes
trivial: the modularized Rx DTX handler front-end can also serve unchanged as
the TFO transform!

This just-described situation holds for the current Themyscira Wireless
implementation of FRv1 codec, named libgsmfr2.  (The 2 in the library name
refers to the major version of library API and dependency structure; the codec
it implements is still FRv1.)  Specifically:

* The full decoder implementation in libgsmfr2 follows the modular approach:
  the front-end Rx DTX handler preprocessor feeds "cleaned up" FRv1 codec frames
  to an unmodified GSM 06.10 decoder.

* No interpolation is done on CN parameters: as soon as each SID update comes
  in, the new parameters are used immediately for all generated CN frames.

The preprocessor part of libgsmfr2 is thus already suitable to serve as a TFO
transform for FRv1.  However, before formally adopting it as such, I have had a
long-standing desire to see how this function was implemented by other vendors;
particularly, how it's been implemented in real historical TRAUs.

Nokia TCSM2 TRAU implementation
===============================

As of 2024-08, we finally have a working bank-of-TRAUs apparatus in our lab:
Nokia TCSM2.  This TRAU implements TFO for FRv1, HRv1 and EFR, hence we finally
got the ability to see how this vendor (Nokia) implemented the elusive TFO
transform.

Here are our findings:

Error concealment function
--------------------------

Themyscira implementation is based on the "example solution" of TS 46.011
chapter 6; Nokia's implementation appears to be very similar, with only a few
visible differences:

* When the ECU enters the state of "speech muting" (after the first speech-state
  BFI for which the last good speech frame is simply repeated), instead of
  decrementing each of the 4 Xmaxcr numbers by 4, it decrements them by 11,
  thereby producing noticeably faster muting than what the spec calls for.

* The state of emitting fixed silence frames is entered not after the
  algorithmically-muted frame in which the lowest Xmaxcr reached 0 (my reading
  of the "example solution" in the spec), but after the state of algorithmic
  muting (decrementing Xmaxcr's by 11 each time) persisted for exactly 5 frames.
  If the original speech frame had its highest Xmaxcr equal to 63, the last
  algorithmically muted frame before fixed silence frames will have 8 in that
  Xmaxcr; if all starting Xmaxcr numbers were low, there will be 5 frames with
  all zeros in Xmaxcr, random Mcr and other parameters unchanged before the
  switch to fixed silence frames.

Nokia's TFO transform exhibits additional logic whereby the first good speech
frame after prolonged BFIs has its highest Xmaxcr reduced (but not messed with
otherwise); if that good speech frame is again followed by BFIs, the ECU goes
back to silence frame output right away - or at least that's what we saw in one
experiment.  This aspect has not been studied in detail.

Comfort noise generation (DTXd=0)
---------------------------------

The comfort noise output from Nokia's TFO transform generally agrees with my
reading of GSM 06.12 spec section 6.1, the section that describes CN generation.
However, the following parts were surprising/unexpected:

1) The TRAU reacts to SID updates with a delay of 24 frames.  Suppose that frame
   #20 in the input is the initial SID, frame #24 (TAF position) is the first
   SID update, frame #48 is the next SID update and so forth.  In the output
   from Nokia's TFO transform, the updated parameters from input frame #24 will
   appear in output frame #48, those from input frame #48 will appear in output
   frame #72 and so forth.  There is no sensible explanation for this extraneous
   buffering delay; at first I thought it was an artifact of the CN parameter
   interpolation mechanism, but:

2) No interpolation is done!  I deliberately constructed input sequences in
   which each subsequent SID update has wildly different parameters from the
   previous, and when the changeover does happen in the DL output after the
   strange delay of 24 frames, the change is immediate and abrupt.

CN muting after two missed SID updates (BFI received instead of SID in the TAF
position twice in a row) is done the same way as speech muting: the TRAU emits
exactly 5 frames with decreasing Xmaxcr (same decrement by 11), then switches
to emitting fixed silence frames.

SID forwarding (DTXd=1)
-----------------------

When DTXd is enabled on the destination call leg and the input frame stream to
the TFO transform includes SID frames (considering only valid SID for now), the
transform does not generate comfort noise - instead received SID frames are
passed through to call leg B DL, unless they are invalid SID or the muting
mechanism has to kick in because of lost SID updates.

Nokia's implementation does pass valid SID frames through (I haven't tested
invalid SID yet), but it applies the same weird delay of 24 frames to the
switchover point for each update as it does when generating CN for DTXd=0.

However, the part where Nokia's TFO transform (at least for FRv1) is plain
broken is CN muting in the case of lost SID updates.  Here is what it does: it
decrements Xmaxcr by 4 (yes, by 4, not by 11) once every 24 frames (probably in
each TAF position), such that if the level of CN was very high before channel
breakdown, it will take up to 7.68 s before this CN is fully muted at the end
receiver.

GSM 06.12 section 5.4 says: "For the second lost SID frame, a muting technique
shall be used on the comfort noise that will gradually decrease the output
level, resulting in silencing of the output after a maximum of 320 ms."  The
spec gives a maximum of 320 ms for total muting of CN, but with Nokia's TFO
transform in DTXd=1 case, that maximum time is 7.68 s - spec requirement
violated.

Only TFO, or regular FRv1 decoder too?
--------------------------------------

How does the regular FRv1 speech decoder (the one that ultimately emits G.711)
in Nokia TCSM2 TRAU implementation compare to what we've observed with their
TFO transform?  Do they use a modular design where the regular decoder is a copy
of the same TFO transform followed by a standard GSM 06.10 decoder block, or do
they do something fancier?

Unfortunately we have no realistic way to answer this question: Nokia chose to
not implement the optional in-band homing mechanism for FRv1, thus we have no
way to pass test sequences through the TRAU in the decoder direction and see if
the output matches our hypothesis as to decoder logic.  Hence the TFO transform
is the only part whose detailed behaviour we can realistically study in this
TRAU.

Take-away for Themyscira implementation
=======================================

My take-away points from the preceding examination of FRv1 TFO transform in
Nokia TCSM2 are:

* Our current Rx DTX handler front-end in libgsmfr2 is fine - Nokia's
  implementation is not any fancier at least in the case of TFO.

* Modularity is a good thing, and so is consistency.  There is nothing wrong
  with using the same Rx DTX handler block both as our TFO transform and as the
  front-end portion of the full decoder in end terminal operation.