# HG changeset patch # User Mychaela Falconia # Date 1725262329 0 # Node ID 0979407719f0ab7332058fa5883fd7a129d8737a # Parent 35d38348c88094e57c5ea84b6b4f83134412d82f doc/TFO-xform/HRv1: article written diff -r 35d38348c880 -r 0979407719f0 doc/TFO-xform/HRv1 --- /dev/null Thu Jan 01 00:00:00 1970 +0000 +++ b/doc/TFO-xform/HRv1 Mon Sep 02 07:32:09 2024 +0000 @@ -0,0 +1,147 @@ +HRv1: relation between regular end decoder and TFO transform +============================================================ + +The reference decoder source published by ETSI in GSM 06.06 exhibits an almost +modular design: the Rx DTX handler front-end is almost a separable piece. +Breaking it down more precisely, we can make these observations: + +0) Most aspects of bad frame handling and comfort noise generation are done by + generating new coded speech parameters, such that the output of those + algorithms can be packaged into new HRv1 codec frames to be sent to a distant + decoder. There are only two exceptions to this modularity: + +1) Handling of unreliable speech frames (BFI=0 UFI=1 in speech rather than CN + state) has a modular and a non-modular aspect: + + 1a) Modular aspect: if R0 increment from the last good frame to the + unreliable frames exceeds a certain threshold, UFI is turned into BFI, + which is then handled in a fully modular fashion. + + 1b) Non-modular aspect: if the R0 increment does not meet the threshold for + turning UFI into BFI but meets another slightly lower threshold, a flag + is set that is passed into the guts of the speech decoder. That flag + effects speech muting on the decoder output level. + +2) GSM 06.22 section 6.2 (Comfort noise generation and updating) says in the + very last sentence: + + "When updating the comfort noise parameters (frame energy and LPC + coefficients), these parameters shall be interpolated over the SID update + period to obtain smooth transitions." + + Note the change in language: the corresponding spec for FRv1 says "should + preferably", but the HRv1 spec says "shall". Furthermore, the bit-exact + implementation in the reference C code is considered normative in this + aspect, and is exercised by the test sequences of GSM 06.07. + + This CN interpolation aspect is non-modular: R0 and the set of LPC + coefficients are decoded from bit parameters into linear form when CN frames + (initial and updates) are received, interpolation is done on this linear + form, and the interpolated values are passed to the main body of the speech + decoder. + +Based on these observations, we can conclude that if we wish to detach this +reference Rx DTX handler for HRv1 from the reference decoder and make it into +an implementation of TFO transform for this codec, we have to solve two +problems: + +1) Decide how to handle those UFI frames that aren't being turned into BFI; + +2) Decide how to handle R0 and LPC parameters during CN insertion. + +Nokia TCSM2 TRAU implementation +=============================== + +Now that we have a working historical bank-of-TRAUs apparatus in our lab, let's +take a look at how this vendor (Nokia) implemented the TFO transform for HRv1 +in their TRAU. Here are our findings: + +* Handling of BFI=1 frames in speech state (not in DTX) exhibits a + simplification relative to GSM 06.06 reference code. The reference code + checks to see if the last saved frame and the received errored frame have the + same voiced vs unvoiced mode: if this mode matches, codevector parameters are + taken from the errored frame, otherwise the last saved frame is regurgitated + without taking any bits from the errored frame. Nokia's TFO transform always + does the latter (no bits are taken from the errored frame) irrespective of + voiced vs unvoiced mode matching or not. + +* Aside from this just-described simplification, all other aspects of BFI=1 + handling for speech frames appear to match the reference code. + +* UFI handling appears to have been taken out altogether, even the part that + "upgrades" UFI to BFI when R0 increment is huge appears to have been omitted. + I fed a test sequence from TFO side that has a good speech frame with R0=2 + followed by a UFI frame with R0=31, and the TRAU happily passed the latter + frame (now treated as perfectly good) to the DL output. + +* Comfort noise generation (DTXd=0) is done exactly as the reference code would + do it, except that neither R0 nor LPC parameters are interpolated. During + each CN output interval between SID updates, R0 and LPC parameters in every + emitted CN frame are exactly equal to those received in the most recent SID + frame, as simple as that. When a new SID update comes in, the change in + emitted R0 and LPC is abrupt. + +* The lost SID criterion for CN muting appears to be slightly different between + Nokia's TFO implementation and my reading of the spec and the reference C + code. My interpretation of GSM 06.22 spec sections 5.2.3 and 5.2.4 is that + unlike FR and EFR, in the case of HR codec the second lost SID (second + occurrence of BFI instead of SID update in TAF position) does _not_ trigger + CN muting; instead this muting is supposed to kick in on the _third_ lost SID + occurrence. (The difference in the spec was likely motivated by TAF positions + occurring every 240 ms with HR instead of every 480 ms with FR & EFR.) My + reading of the reference C code agrees with my reading of the spec - yet + Nokia's TFO implementation initiates CN muting in the frame following the + second lost SID, not third. + +* Aside from the criterion for its initiation, the actual CN muting logic + behaves exactly like the reference C code: R0 is decremented by 2 on each + output frame following the TAF that initiates this sequence, and once R0 + reaches 0, it stays there while this zero-magnitude CN output continues + indefinitely. + +* With DTXd=1 CN output is replaced with repeated retransmission of the same + SID whose parameters would have been used for non-interpolated CN with DTXd=0, + which also agrees with the rules of GSM 08.62 section 8.2.2 paragraph 2. + +* CN muting with DTXd=1 is implemented poorly. The TRAU emits SID frames with + R0 decrementing by 2 on each frame just like how it does for generated CN + output that's in the process of being slowly muted, but this design is a poor + choice: because the BTS will only transmit one of every 12 SID update frames + and the TRAU has no way of knowing which SID will be transmitted, slow + decrement cadence on SID frames themselves (not on CN output) makes no sense. + +Thoughts for Themyscira implementation +====================================== + +Prior to getting Nokia TCSM2 working in our lab and being able to experiment +with this TRAU, when I was contemplating the idea of potentially implementing +TFO transform for HRv1 in Themyscira libraries, my main trepidation was how to +produce comfort noise in the form of "speech" parameter output. For endpoint +decoders GSM 06.22 prescribes a bit-exact algorithm with interpolation, but +that smoothly interpolated CN cannot be readily expressed in terms of parameter +bits that can be packed into a new HRv1 codec frame. I thought about +requantizing the interpolated LPC reflection coefficients on every CN output +frame, using the same computationally intensive vector quantization algorithm +as in speech encoding - but because I am not an expert in codec design, it is +not obvious to me whether or not such approach would produce good results. + +However, seeing that Nokia got away with simply passing R0 and LPC parameters +along from incoming SID frames to CN output without any interpolation or other +transformation gives us a huge confidence boost - if Nokia did it, so can we! +This approach is of course simple, and yields itself readily to elegant +implementation. + +Seeing that Nokia got away with effectively discarding UFI in their TFO +transform is also a confidence boost - once again if Nokia did it, so can we. +I plan on keeping the logic that "upgrades" UFI to BFI under certain conditions +(not sure why Nokia omitted it), but the effect of potentially muting speech in +the guts of the decoder (past parameter-level manipulation) is not really +feasible to implement in a TFO transform. + +Finally, regarding the logic that takes codevector parameters from errored +(BFI) frames when the voicing mode matches between the last saved frame and the +errored frame, the logic that exists in the reference C code but not in Nokia's +TFO transform: I plan on keeping this logic in our version, but Nokia's approach +will come in handy for handling BFI-no-data frames, a condition that does not +exist in TDM-based Abis transport or in TFO, but does unfortunately exist in +IP-based GSM RAN.