FreeCalypso > hg > gsm-net-reveng
view doc/TFO-xform/FRv1 @ 36:d9553c7ac6ea
doc/TFO-xform/EFR: beginning of article
author | Mychaela Falconia <falcon@freecalypso.org> |
---|---|
date | Tue, 03 Sep 2024 07:08:24 +0000 |
parents | 35d38348c880 |
children |
line wrap: on
line source
Rx DTX handler situation in FRv1 ================================ Before we address the question of how one should implement TFO transform for FRv1, let's begin with a more basic question: how does the Rx DTX handler (the "front end" part of the speech decoder in an end-terminal implementation) work in FRv1? In both HRv1 and EFR, error-free comfort noise generation functions of this Rx DTX handler are normative per the specs at bit-exact level, while error handling functions are specified only as a non-normative example - and the supplied reference C sources implement the full Rx DTX handler (both the normative part and the "example" part) as an inseparable part of the speech decoder. But not so for FRv1: there is no reference C source and there are no bit-exact definitions for any part of Rx DTX handler logic. All Rx DTX handler functions are defined only in English prose (no code), and even in the most normative parts the language used in the specs is quite loose. Based on what is specified (verbally, loosely) in GSM 06.11 and 06.12, there are two principal ways in which an Rx-ECU-capable, Rx-DTX-capable FRv1 speech decoder can be implemented: Fully modular approach: the basic GSM 06.10 decoder block (which is bit-exact, but cannot handle BFIs or SID frames) remains absolutely unmodified, while the Rx DTX handler (which includes both error concealment and CN generation) is implemented as a modular piece, with an "honest-to-god" 260-bit 06.10 frame interface between the two blocks. Non-modular approach: the Rx DTX handler and the 06.10-based speech decoder are integrated more tightly, and there is no possible stream of "pure" 06.10 codec frames that would produce the same bit-exact PCM output as the actually implemented "full decoder" with the built-in Rx DTX handler. Cursory reading of GSM 06.11 and 06.12 specs strongly suggests that they call for the fully modular approach as defined above. However, because neither spec includes any bit-exact definitions, there is no formal stipulation that the modular approach shall be used - it is entirely conceivable that someone could implement a non-modular approach, and they would still be spec-compliant. Why would anyone implement the non-modular approach when the fully modular one seems much simpler? After all, the bit-exact basic 06.10 decoder already exists - surely it is easier to build a separate front-end to it than dig into the guts of that pre-existing box? There is, however, one aspect that could sway implementors toward the non-modular approach: interpolation of CN parameter updates during prolonged DTX pauses. GSM 06.12 (or rather its latest incarnation as 3GPP TS 46.012) says, at the very end of section 6.1: "When updating the comfort noise, the parameters above should preferably be interpolated over a few frames to obtain smooth transitions." This kind of CN parameter interpolation is mandatory in the newer HRv1 and EFR codecs where the CN generator function is defined in bit-exact terms, hence it makes sense that some implementors may have chosen to back-port the same feature to FRv1. CN parameter interpolation: deeper analysis of the problem ========================================================== How does this interpolation feature affect the choice of modular or non-modular design? As a non-expert on the subject of codec design, I am not able to say authoritatively if it is possible to implement the feature of CN parameter interpolation (and do it well) while staying with the fully modular design in which the basic 06.10 decoder block remains absolutely unchanged, or if high- quality implementation of this feature would require foregoing the modularity and moving the CN-specific interpolation function somewhere inside that block, e.g., between the output of GSM 06.10 section 4.2.8 and the input to section 4.2.9, as referenced from section 4.3.3 for the decoder. We can, however, look at how ETSI handled this problem in other codecs for which they did mandate CN parameter interpolation in bit-exact form. HRv1 is the best point of comparison in this regard because of this detail: the Rx DTX handler front-end part of the official bit-exact HRv1 decoder (delivered as C source this time, not just verbiage) is _almost_ modular, i.e., one could _almost_ detach it into a modular piece whose output could be fed to the decoder as a new "cleaned up" stream of HRv1 codec frames. Where is the "almost" part? Answer: interpolation of CN parameters! When HRv1 decoder is in CN insertion state, it dequantizes R0 and LPC parameters from SID frames only when initial and update frames come in - but when it generates the actual CN between those updates, it performs smooth linear interpolation on the decoded parameters, *without* requantizing them into something that can be retransmitted as new HRv1 codec frames representing the CN. Once again, as a non-expert on the subject of codec design, I am not able to say authoritatively if the same approach that was prescribed by ETSI for HRv1 would also work for FRv1, or if CN parameter interpolation for FRv1 can be done well by requantizing the interpolated parameters for each individual CN output frame and feeding them to a strictly unmodified 06.10 decoder block. It is the case, however, that there is no pre-existing implementation available to us which we can look at that does CN parameter interpolation for FRv1 - the TFO transform in Nokia TCSM2 does _not_ interpolate - hence without a reference to look at, this optional feature is a can of worms which we should stay away from. Front-end part of the speech decoder and TFO transform ====================================================== If the party who implemented the regular end-decoder for FRv1 chose the fully modular approach, either by disregarding the call for interpolation of CN parameters (the spec language is "should preferably", rather than "shall") or by requantizing the interpolated parameters on each CN output frame, then a corresponding implementation of TFO transform for non-DTXd operation becomes trivial: the modularized Rx DTX handler front-end can also serve unchanged as the TFO transform! This just-described situation holds for the current Themyscira Wireless implementation of FRv1 codec, named libgsmfr2. (The 2 in the library name refers to the major version of library API and dependency structure; the codec it implements is still FRv1.) Specifically: * The full decoder implementation in libgsmfr2 follows the modular approach: the front-end Rx DTX handler preprocessor feeds "cleaned up" FRv1 codec frames to an unmodified GSM 06.10 decoder. * No interpolation is done on CN parameters: as soon as each SID update comes in, the new parameters are used immediately for all generated CN frames. The preprocessor part of libgsmfr2 is thus already suitable to serve as a TFO transform for FRv1. However, before formally adopting it as such, I have had a long-standing desire to see how this function was implemented by other vendors; particularly, how it's been implemented in real historical TRAUs. Nokia TCSM2 TRAU implementation =============================== As of 2024-08, we finally have a working bank-of-TRAUs apparatus in our lab: Nokia TCSM2. This TRAU implements TFO for FRv1, HRv1 and EFR, hence we finally got the ability to see how this vendor (Nokia) implemented the elusive TFO transform. Here are our findings: Error concealment function -------------------------- Themyscira implementation is based on the "example solution" of TS 46.011 chapter 6; Nokia's implementation appears to be very similar, with only a few visible differences: * When the ECU enters the state of "speech muting" (after the first speech-state BFI for which the last good speech frame is simply repeated), instead of decrementing each of the 4 Xmaxcr numbers by 4, it decrements them by 11, thereby producing noticeably faster muting than what the spec calls for. * The state of emitting fixed silence frames is entered not after the algorithmically-muted frame in which the lowest Xmaxcr reached 0 (my reading of the "example solution" in the spec), but after the state of algorithmic muting (decrementing Xmaxcr's by 11 each time) persisted for exactly 5 frames. If the original speech frame had its highest Xmaxcr equal to 63, the last algorithmically muted frame before fixed silence frames will have 8 in that Xmaxcr; if all starting Xmaxcr numbers were low, there will be 5 frames with all zeros in Xmaxcr, random Mcr and other parameters unchanged before the switch to fixed silence frames. Nokia's TFO transform exhibits additional logic whereby the first good speech frame after prolonged BFIs has its highest Xmaxcr reduced (but not messed with otherwise); if that good speech frame is again followed by BFIs, the ECU goes back to silence frame output right away - or at least that's what we saw in one experiment. This aspect has not been studied in detail. Comfort noise generation (DTXd=0) --------------------------------- The comfort noise output from Nokia's TFO transform generally agrees with my reading of GSM 06.12 spec section 6.1, the section that describes CN generation. However, the following parts were surprising/unexpected: 1) The TRAU reacts to SID updates with a delay of 24 frames. Suppose that frame #20 in the input is the initial SID, frame #24 (TAF position) is the first SID update, frame #48 is the next SID update and so forth. In the output from Nokia's TFO transform, the updated parameters from input frame #24 will appear in output frame #48, those from input frame #48 will appear in output frame #72 and so forth. There is no sensible explanation for this extraneous buffering delay; at first I thought it was an artifact of the CN parameter interpolation mechanism, but: 2) No interpolation is done! I deliberately constructed input sequences in which each subsequent SID update has wildly different parameters from the previous, and when the changeover does happen in the DL output after the strange delay of 24 frames, the change is immediate and abrupt. CN muting after two missed SID updates (BFI received instead of SID in the TAF position twice in a row) is done the same way as speech muting: the TRAU emits exactly 5 frames with decreasing Xmaxcr (same decrement by 11), then switches to emitting fixed silence frames. SID forwarding (DTXd=1) ----------------------- When DTXd is enabled on the destination call leg and the input frame stream to the TFO transform includes SID frames (considering only valid SID for now), the transform does not generate comfort noise - instead received SID frames are passed through to call leg B DL, unless they are invalid SID or the muting mechanism has to kick in because of lost SID updates. Nokia's implementation does pass valid SID frames through (I haven't tested invalid SID yet), but it applies the same weird delay of 24 frames to the switchover point for each update as it does when generating CN for DTXd=0. However, the part where Nokia's TFO transform (at least for FRv1) is plain broken is CN muting in the case of lost SID updates. Here is what it does: it decrements Xmaxcr by 4 (yes, by 4, not by 11) once every 24 frames (probably in each TAF position), such that if the level of CN was very high before channel breakdown, it will take up to 7.68 s before this CN is fully muted at the end receiver. GSM 06.12 section 5.4 says: "For the second lost SID frame, a muting technique shall be used on the comfort noise that will gradually decrease the output level, resulting in silencing of the output after a maximum of 320 ms." The spec gives a maximum of 320 ms for total muting of CN, but with Nokia's TFO transform in DTXd=1 case, that maximum time is 7.68 s - spec requirement violated. Only TFO, or regular FRv1 decoder too? -------------------------------------- How does the regular FRv1 speech decoder (the one that ultimately emits G.711) in Nokia TCSM2 TRAU implementation compare to what we've observed with their TFO transform? Do they use a modular design where the regular decoder is a copy of the same TFO transform followed by a standard GSM 06.10 decoder block, or do they do something fancier? Unfortunately we have no realistic way to answer this question: Nokia chose to not implement the optional in-band homing mechanism for FRv1, thus we have no way to pass test sequences through the TRAU in the decoder direction and see if the output matches our hypothesis as to decoder logic. Hence the TFO transform is the only part whose detailed behaviour we can realistically study in this TRAU. Take-away for Themyscira implementation ======================================= My take-away points from the preceding examination of FRv1 TFO transform in Nokia TCSM2 are: * Our current Rx DTX handler front-end in libgsmfr2 is fine - Nokia's implementation is not any fancier at least in the case of TFO. * Modularity is a good thing, and so is consistency. There is nothing wrong with using the same Rx DTX handler block both as our TFO transform and as the front-end portion of the full decoder in end terminal operation.