view doc/TFO-xform/EFR @ 56:b32b644b7d96

d144/nokia-tcsm2-atrau.bin: captured A-TRAU output from Nokia TCSM2, fed with ul-input from Ater
author Mychaela Falconia <falcon@freecalypso.org>
date Wed, 25 Sep 2024 07:42:04 +0000
parents 4ab7cc414ed2
children
line wrap: on
line source

TFO transform for EFR
=====================

Unlike the situation with FRv1 and HRv1, the standard endpoint decoder for EFR
provides no help for implementing a TFO transform.  The reference EFR decoder
source from ETSI includes bad frame handling and Rx DTX functions, but the logic
that implements these functions is interwoven throughout the body of the decoder
and does not form a separable front-end.  Most saliently, this Rx DTX and ECU
logic in the reference decoder does not operate on coded parameters as would be
needed for a TFO transform, instead it operates on linear values deeper in the
decoder after parameter dequantization.

Given that Abis is a de facto proprietary interface that is not interoperable
between different vendors (and the same holds for Ater in those BSS designs
that separate the TRAU from the BSC), and given how daunting it seems to
implement a true TFO transform for EFR, prior to getting our Nokia TCSM2 lab
setup I was wondering if historical TRAU vendors really did implement this
TFO transform, or if perhaps they used some kind of "cheating" trick on their
Abis similar to what we did in OsmoBTS in mid-2023.  However, once I got our
Nokia TCSM2 gear working, set up a TFO connection between two active TRAU
channels in EFR mode and passed some test sequences through it, it became clear
that Nokia did implement a real "honest-to-god" TFO transform for EFR: the
TRAU-DL frame stream is 100% valid "speech" frames (no idle frames or other
aberrations inserted) even when the TRAU-UL stream fed via TFO contains BFI
speech frames and DTXu pauses - the TRAU really does apply bad frame handling
and comfort noise insertion on parameter level.

Seeing that at least one major historical vendor did implement TFO transform
for EFR, and seeing the output from that transform, has set up a sportive
challenge for me: I no longer have a valid excuse to not do it.  I now have a
desire to produce a FOSS implementation of TFO transform for EFR in Themyscira
libraries (probably in libgsmefr), and make it no worse than Nokia's
implementation in TCSM2.

Bad frame handling in speech mode
=================================

Looking at the DL speech frames that were synthesized by the TRAU in those
frame positions where the incoming UL stream via TFO had BFIs, we can make the
following observations:

* The 5 LPC parameters are different in each generated substitution/muting
  frame, hence it looks like the TFO transform is running the quantization
  algorithm for each output frame to produce LPC parameters that aim for the
  substitution/muting LSFs of the official "example solution".

  If the series of BFI inputs continues for a while, the emitted LPC parameters
  settle into an oscillating pattern that alternates between two sets of
  numbers.

* LTP lag parameters remain constant for each run of BFIs between good speech
  frames; the lag value encoded therein matches the LTP lag (integer part only)
  from the 4th subframe of the last good speech frame, just like in the official
  endpoint decoder.

* Surprising bit: the 4 LTP gain values from the last good speech frame are
  endlessly regurgitated verbatim in each substitution/muting frame, without
  any signs of the attenuation I expected to see based on the official "example
  solution".

* Another surprising bit: the 35-bit fixed codebook sequence in each subframe
  is taken from the corresponding subframe of the last good speech frame,
  contrary to the official "example solution" that takes these bits from the
  errored frames.

* The four fixed codebook gain parameters in the emitted substitution/muting
  frames differ from one frame to the next in the case of multiple BFI frames
  in a row, and they also differ between subframes in the same frame - hence
  these parameters are clearly being regenerated as output progresses.  However,
  the quantization algorithm for this parameter is so complex that I haven't
  been able to make a more intelligent analysis yet.

  If the series of BFI inputs continues for a while, the emitted fixed codebook
  gain parameters slowly go down and eventually become all zeros - although the
  exact meaning is still unclear given the highly non-intuitive quantization
  algorithm.

Looking at the first good speech frame that follows each BFI substitution/muting
insert, we see that it is mostly unaltered: no alterations were seen to LPC or
LTP parameters, in particular.  However, in the case of the fixed codebook gain
parameter we see a different behavioral pattern: most of the time it is also
unaltered, but sometimes we see reduction in this parameter, and even then it
is only in certain subframes.  Are we perhaps seeing a capping of the fixed
codebook gain in the first good frame following BFI, similar to that implemented
in the reference endpoint decoder?  A better understanding of the quantization
mechanism for this parameter will be needed.

CN insertion by TFO transform
=============================

Looking at the DL speech frames that were synthesized by the TRAU in those
frame positions where the incoming UL stream via TFO had DTXu pauses (valid SID
frames followed by BFIs), we can make the following observations:

* The 5 LPC parameters appear to be generated anew on each output frame just
  like in the substitution/muting case, and it likewise appears that the TFO
  transform is running the regular LSF quantization algorithm taken from the
  encoder.

* The 4 LTP lag parameters are set to {135, 33, 135, 33} in each generated CN
  frame, in agreement with how the official endpoint decoder sets the pitch
  delay to constant value 40.

* The 4 LTP gain parameters are all set to 0, also in agreement with CN
  generation in the official endpoint decoder.

* The 35-bit fixed codebook part of each subframe appears to be set to a
  pseudorandom sequence, different in each emitted frame and subframe.  My
  analysis tells me it should be possible to construct fixed codebook sequences
  in "speech" output frames that would produce the same excitation as the
  official bit-exact CN - although the final PCM output probably won't match
  the official bit-exact CN because of LSF and fixed codebook gain
  requantization.  However, we won't know whether or not the output from
  Nokia's TFO transform matches our idea of official-CN-matching fixed codebook
  excitation until we have our own implementation of this idea and compare
  the two.

* The four fixed codebook gain parameters in the emitted CN frames are once
  again too difficult to understand for now - but they are definitely being
  recomputed anew for each emitted CN frame and subframe.

If CN muting kicks in on the second lost SID (BFI instead of SID received in
TAF position), we see the following additional behaviour:

* On the TAF-position frame that initiates CN muting, the emitted LPC parameters
  break out of the alternating pattern they previously settled into.  They go
  through a few unique number sets, then settle into a two-state oscillating
  pattern once again.  Is the TFO transform perhaps making a switch from
  last-SID LSF numbers to the static "mean" ones when it goes into CN muting?

* The emitted fixed codebook gain parameters start going down and eventually
  become all zeros.

Looking at the first good speech frame that follows each CN insertion period,
we see only two alterations made by the TFO transform: the 5 LPC parameters and
the first subframe fixed codebook gain parameter are modified, presumably to
compensate for the lack of quantizer state reset that happens when the end
decoder has seen a CN insert.  No more speech parameter alterations are seen
past the first subframe of the first frame following the DTXu pause.