view doc/TFO-xform/Theory @ 33:e828468b0afd

doc/TFO-xform/Theory: article written
author Mychaela Falconia <falcon@freecalypso.org>
date Sat, 31 Aug 2024 20:45:25 +0000
parents
children
line wrap: on
line source

TFO transform from uplink to downlink
=====================================

With all 3 classic GSM codecs (FRv1, HRv1, EFR) the original architecture calls
for a network-side transcoder (TRAU) on each individual call leg.  The
implications are:

* The uplink runs from the MS to the speech decoder in the TRAU that turns the
  mobile-generated speech into 64 kbit/s G.711.  The Rx DTX handler, a subblock
  of that speech decoder in the TRAU, handles error concealment (substitution
  and muting of lost frames) and comfort noise insertion during DTXu pauses,
  and once this speech stream has been transcoded to G.711, all trace of these
  GSM-specific effects disappears.

* The downlink runs from the speech encoder in the TRAU to TCH DL radio output
  from the BTS.  Because the DL frame stream comes from a free-running speech
  encoder, it never contains errored frames or invalid SID or any other
  aberrations: without DTXd, this frame stream is 100% good speech frames, and
  with DTXd, it is a mixture of good speech and valid SID frames.

But suppose you have two mobile call legs (mobile user Alice calls mobile user
Bob), and you wish to eliminate the quality-degrading effect of double or tandem
transcoding by passing compressed speech frames directly from Alice to Bob and
vice-versa - what happens now?  The UL frame stream from each call leg will
contain BFI frame gaps that are never allowed in DL, and if the network deploys
DTX only in the UL direction (DTXu without DTXd, a very sensible choice for
small-capacity single-carrier cells), the representation of DTXu pauses coming
from each call leg (SID frames followed by prolonged BFI gaps) is also not
suitable for direct passing to the DL of the opposite call leg.

The solution offered in the TFO spec (GSM 08.62) is a special transform from
call leg A UL to call leg B DL.  This transform has no official name that I
could find, but I call it "TFO transform".  In the original GSM 08.62 spec (up
to R99) this TFO transform is described in sections 8.2.1 and 8.2.2; when the
spec changed to 28.062 with 3GPP Release 4 (adding AMR in GSM and AMR-only
UMTS), the description of TFO transform for classic GSM codecs moved to section
C.3.2.1.1.

However, both spec versions only say what "shall" be done without any guidance
on how to do it algorithmically: the spec language is "subject to manufacturer
dependent future improvements and is not part of this recommendation."
Distilling the problem to its essence, the addition of TFO introduces a new type
of logical transform on codec frames (and a stateful one at that!) that never
appeared previously anywhere in classic GSM architecture, is not mentioned in
any other spec, and is not addressed at all by any of the reference codec
sources.  This new transform is implemented only in the TFO block in TRAUs and
nowhere else (in classic GSM architecture), and can be exercised only by
establishing a TFO call between two interworking TRAUs.

There are 3 main parts to this TFO transform, 3 main areas where anyone who
seeks to implement this transform has to think hard and come up with an
innovative solution:

1) Error concealment in non-DTX speech: if an errored frame (BFI) appears after
   non-SID speech frames (meaning non-DTX speech), the transform has to fill in
   substitution/muting "speech" frames (meaning codec frames that look like
   valid speech frames) in the stream going to call leg B DL.

2) Comfort noise insertion: if the incoming frame stream from call leg A UL
   contains SID frames (DTXu) but the same are not allowed on call leg B DL
   (no DTXd), the transform has to insert "speech" frames (in the same
   parenthetical meaning) that represent comfort noise, as intended by Alice's
   phone that transmitted SID with certain CN parameters.

3) Comfort noise muting: handling the case where the incoming UL frame stream
   goes into CN insertion state (via one or more SID frames), but then goes
   total BFI, with no more SID update frames appearing in TAF positions.  In
   the case of a single codec leg from a source encoder to an end decoder,
   standard decoders are required by their respective DTX specs to gradually
   mute their CN output, to indicate channel breakdown to the user - the TFO
   transform has to produce the same effect.

All 3 of the just-listed functions are explicitly called out in the TFO spec, in
each case with the same language of "shall" followed by "subject to manufacturer
dependent future improvements and is not part of this recommendation."

DTXd or no DTXd
===============

When the destination call leg operates without DTXd, the TFO transform can only
emit frames that are well-formed speech frames for the respective codec, no SID
frames.  In this case the transform has to do "everything", all 3 of the listed
functions, although the last function of CN muting may be either separate or
absorbed into CN generation function depending on the codec.

OTOH, when call leg B has DTXd enabled/allowed, there is more room for
additional complexity.  The simplest solution would be to not make use of DTXd
capability and always emit speech frames - but the problem with this simple
approach is teleological.  If a GSM network operator runs with DTXd enabled,
presumably that operator seeks to reap the benefits of DTXd as in reduction of
radio interference, in which case a TFO transform that fails to make use of DTXd
capability would defeat the purpose.  Hence if someone sets out to implement a
TFO transform that supports full utilization of DTXd, they would have to do
additional work:

* The function of CN insertion in the transform _mostly_ goes away: if a valid
  SID frame comes, the TRAU caches it and repeats it continuously until the
  next SID update, allowing the BTS to select which SID frames it will actually
  transmit based on its SACCH alignment.  But more complex handling is still
  needed if the first SID frame (the one that begins CN insertion period) came
  in as invalid SID, and the function of CN muting takes on new significance.

* CN muting: when the cached SID expires and no new SID updates arrive in TAF
  positions, the TFO transform has to indicate somehow to Bob that Alice's call
  leg is having trouble, which will be easy or difficult depending on what rules
  are specified in the codec specs for SID interpolation in the final receiver.

* Error concealment in non-DTX speech: at first glance this function appears to
  be exactly the same whether DTXd is used or not.  But consider the case of
  total channel breakdown, such that the incoming frame stream becomes all BFI:
  how should this case be handled?  In the absence of DTXd, the output of the
  TFO transform becomes a stream of silence frames, meaning some kind of
  "speech" frames that produce total silence at the end decoder.  But if the
  network operates with DTXd with the aim of reducing radio interference, these
  silence "speech" frames should be replaced with SIDs whose parameters are
  chosen to produce silent output.

Current approach in Themyscira libraries
========================================

There is a desire to implement TFO transform for all 3 classic GSM codecs in
Themyscira Wireless GSM codec libraries suite, and the first question to be
decided is the policy with regard to DTXd.

The current approach is to not implement any DTXd support, i.e., implement the
TFO transform only in its no-DTXd basic form.  The reason for this decision is
based on the reality of small-capacity single-carrier cells: given that the
total number of humans who actually _want_ to use GSM (as opposed to whatever
latest 4G/5G/etc is peddled by Big Tech mafia) is vanishingly small, there is
currently no justification for building higher-capacity GSM cells that use more
than a single 200 kHz radio carrier.  And if each GSM cell consists of only one
radio carrier (the BCCH carrier, also called C0 in the specs), then physical
DTXd (as in actually turning off radio Tx, as opposed to "logical" DTXd where
that effect is merely faked for the MS by transmitting dummy bursts or
induced-BFI frames) is simply impossible.  Therefore, in the present state of
human condition, there is no justification for expending the effort to implement
additional complexity for proper DTXd.