comparison doc/TFO-xform/FRv1 @ 34:35d38348c880

doc/TFO-xform/FRv1: article written
author Mychaela Falconia <falcon@freecalypso.org>
date Sun, 01 Sep 2024 06:28:35 +0000
parents
children
comparison
equal deleted inserted replaced
33:e828468b0afd 34:35d38348c880
1 Rx DTX handler situation in FRv1
2 ================================
3
4 Before we address the question of how one should implement TFO transform for
5 FRv1, let's begin with a more basic question: how does the Rx DTX handler (the
6 "front end" part of the speech decoder in an end-terminal implementation) work
7 in FRv1? In both HRv1 and EFR, error-free comfort noise generation functions
8 of this Rx DTX handler are normative per the specs at bit-exact level, while
9 error handling functions are specified only as a non-normative example - and
10 the supplied reference C sources implement the full Rx DTX handler (both the
11 normative part and the "example" part) as an inseparable part of the speech
12 decoder. But not so for FRv1: there is no reference C source and there are no
13 bit-exact definitions for any part of Rx DTX handler logic. All Rx DTX handler
14 functions are defined only in English prose (no code), and even in the most
15 normative parts the language used in the specs is quite loose.
16
17 Based on what is specified (verbally, loosely) in GSM 06.11 and 06.12, there
18 are two principal ways in which an Rx-ECU-capable, Rx-DTX-capable FRv1 speech
19 decoder can be implemented:
20
21 Fully modular approach: the basic GSM 06.10 decoder block (which is bit-exact,
22 but cannot handle BFIs or SID frames) remains absolutely unmodified, while the
23 Rx DTX handler (which includes both error concealment and CN generation) is
24 implemented as a modular piece, with an "honest-to-god" 260-bit 06.10 frame
25 interface between the two blocks.
26
27 Non-modular approach: the Rx DTX handler and the 06.10-based speech decoder are
28 integrated more tightly, and there is no possible stream of "pure" 06.10 codec
29 frames that would produce the same bit-exact PCM output as the actually
30 implemented "full decoder" with the built-in Rx DTX handler.
31
32 Cursory reading of GSM 06.11 and 06.12 specs strongly suggests that they call
33 for the fully modular approach as defined above. However, because neither spec
34 includes any bit-exact definitions, there is no formal stipulation that the
35 modular approach shall be used - it is entirely conceivable that someone could
36 implement a non-modular approach, and they would still be spec-compliant.
37
38 Why would anyone implement the non-modular approach when the fully modular one
39 seems much simpler? After all, the bit-exact basic 06.10 decoder already
40 exists - surely it is easier to build a separate front-end to it than dig into
41 the guts of that pre-existing box? There is, however, one aspect that could
42 sway implementors toward the non-modular approach: interpolation of CN
43 parameter updates during prolonged DTX pauses. GSM 06.12 (or rather its latest
44 incarnation as 3GPP TS 46.012) says, at the very end of section 6.1:
45
46 "When updating the comfort noise, the parameters above should preferably be
47 interpolated over a few frames to obtain smooth transitions."
48
49 This kind of CN parameter interpolation is mandatory in the newer HRv1 and EFR
50 codecs where the CN generator function is defined in bit-exact terms, hence it
51 makes sense that some implementors may have chosen to back-port the same feature
52 to FRv1.
53
54 CN parameter interpolation: deeper analysis of the problem
55 ==========================================================
56
57 How does this interpolation feature affect the choice of modular or non-modular
58 design? As a non-expert on the subject of codec design, I am not able to say
59 authoritatively if it is possible to implement the feature of CN parameter
60 interpolation (and do it well) while staying with the fully modular design in
61 which the basic 06.10 decoder block remains absolutely unchanged, or if high-
62 quality implementation of this feature would require foregoing the modularity
63 and moving the CN-specific interpolation function somewhere inside that block,
64 e.g., between the output of GSM 06.10 section 4.2.8 and the input to section
65 4.2.9, as referenced from section 4.3.3 for the decoder.
66
67 We can, however, look at how ETSI handled this problem in other codecs for
68 which they did mandate CN parameter interpolation in bit-exact form. HRv1 is
69 the best point of comparison in this regard because of this detail: the Rx DTX
70 handler front-end part of the official bit-exact HRv1 decoder (delivered as C
71 source this time, not just verbiage) is _almost_ modular, i.e., one could
72 _almost_ detach it into a modular piece whose output could be fed to the
73 decoder as a new "cleaned up" stream of HRv1 codec frames. Where is the
74 "almost" part? Answer: interpolation of CN parameters! When HRv1 decoder is
75 in CN insertion state, it dequantizes R0 and LPC parameters from SID frames
76 only when initial and update frames come in - but when it generates the actual
77 CN between those updates, it performs smooth linear interpolation on the decoded
78 parameters, *without* requantizing them into something that can be retransmitted
79 as new HRv1 codec frames representing the CN.
80
81 Once again, as a non-expert on the subject of codec design, I am not able to say
82 authoritatively if the same approach that was prescribed by ETSI for HRv1 would
83 also work for FRv1, or if CN parameter interpolation for FRv1 can be done well
84 by requantizing the interpolated parameters for each individual CN output frame
85 and feeding them to a strictly unmodified 06.10 decoder block. It is the case,
86 however, that there is no pre-existing implementation available to us which we
87 can look at that does CN parameter interpolation for FRv1 - the TFO transform
88 in Nokia TCSM2 does _not_ interpolate - hence without a reference to look at,
89 this optional feature is a can of worms which we should stay away from.
90
91 Front-end part of the speech decoder and TFO transform
92 ======================================================
93
94 If the party who implemented the regular end-decoder for FRv1 chose the fully
95 modular approach, either by disregarding the call for interpolation of CN
96 parameters (the spec language is "should preferably", rather than "shall") or
97 by requantizing the interpolated parameters on each CN output frame, then a
98 corresponding implementation of TFO transform for non-DTXd operation becomes
99 trivial: the modularized Rx DTX handler front-end can also serve unchanged as
100 the TFO transform!
101
102 This just-described situation holds for the current Themyscira Wireless
103 implementation of FRv1 codec, named libgsmfr2. (The 2 in the library name
104 refers to the major version of library API and dependency structure; the codec
105 it implements is still FRv1.) Specifically:
106
107 * The full decoder implementation in libgsmfr2 follows the modular approach:
108 the front-end Rx DTX handler preprocessor feeds "cleaned up" FRv1 codec frames
109 to an unmodified GSM 06.10 decoder.
110
111 * No interpolation is done on CN parameters: as soon as each SID update comes
112 in, the new parameters are used immediately for all generated CN frames.
113
114 The preprocessor part of libgsmfr2 is thus already suitable to serve as a TFO
115 transform for FRv1. However, before formally adopting it as such, I have had a
116 long-standing desire to see how this function was implemented by other vendors;
117 particularly, how it's been implemented in real historical TRAUs.
118
119 Nokia TCSM2 TRAU implementation
120 ===============================
121
122 As of 2024-08, we finally have a working bank-of-TRAUs apparatus in our lab:
123 Nokia TCSM2. This TRAU implements TFO for FRv1, HRv1 and EFR, hence we finally
124 got the ability to see how this vendor (Nokia) implemented the elusive TFO
125 transform.
126
127 Here are our findings:
128
129 Error concealment function
130 --------------------------
131
132 Themyscira implementation is based on the "example solution" of TS 46.011
133 chapter 6; Nokia's implementation appears to be very similar, with only a few
134 visible differences:
135
136 * When the ECU enters the state of "speech muting" (after the first speech-state
137 BFI for which the last good speech frame is simply repeated), instead of
138 decrementing each of the 4 Xmaxcr numbers by 4, it decrements them by 11,
139 thereby producing noticeably faster muting than what the spec calls for.
140
141 * The state of emitting fixed silence frames is entered not after the
142 algorithmically-muted frame in which the lowest Xmaxcr reached 0 (my reading
143 of the "example solution" in the spec), but after the state of algorithmic
144 muting (decrementing Xmaxcr's by 11 each time) persisted for exactly 5 frames.
145 If the original speech frame had its highest Xmaxcr equal to 63, the last
146 algorithmically muted frame before fixed silence frames will have 8 in that
147 Xmaxcr; if all starting Xmaxcr numbers were low, there will be 5 frames with
148 all zeros in Xmaxcr, random Mcr and other parameters unchanged before the
149 switch to fixed silence frames.
150
151 Nokia's TFO transform exhibits additional logic whereby the first good speech
152 frame after prolonged BFIs has its highest Xmaxcr reduced (but not messed with
153 otherwise); if that good speech frame is again followed by BFIs, the ECU goes
154 back to silence frame output right away - or at least that's what we saw in one
155 experiment. This aspect has not been studied in detail.
156
157 Comfort noise generation (DTXd=0)
158 ---------------------------------
159
160 The comfort noise output from Nokia's TFO transform generally agrees with my
161 reading of GSM 06.12 spec section 6.1, the section that describes CN generation.
162 However, the following parts were surprising/unexpected:
163
164 1) The TRAU reacts to SID updates with a delay of 24 frames. Suppose that frame
165 #20 in the input is the initial SID, frame #24 (TAF position) is the first
166 SID update, frame #48 is the next SID update and so forth. In the output
167 from Nokia's TFO transform, the updated parameters from input frame #24 will
168 appear in output frame #48, those from input frame #48 will appear in output
169 frame #72 and so forth. There is no sensible explanation for this extraneous
170 buffering delay; at first I thought it was an artifact of the CN parameter
171 interpolation mechanism, but:
172
173 2) No interpolation is done! I deliberately constructed input sequences in
174 which each subsequent SID update has wildly different parameters from the
175 previous, and when the changeover does happen in the DL output after the
176 strange delay of 24 frames, the change is immediate and abrupt.
177
178 CN muting after two missed SID updates (BFI received instead of SID in the TAF
179 position twice in a row) is done the same way as speech muting: the TRAU emits
180 exactly 5 frames with decreasing Xmaxcr (same decrement by 11), then switches
181 to emitting fixed silence frames.
182
183 SID forwarding (DTXd=1)
184 -----------------------
185
186 When DTXd is enabled on the destination call leg and the input frame stream to
187 the TFO transform includes SID frames (considering only valid SID for now), the
188 transform does not generate comfort noise - instead received SID frames are
189 passed through to call leg B DL, unless they are invalid SID or the muting
190 mechanism has to kick in because of lost SID updates.
191
192 Nokia's implementation does pass valid SID frames through (I haven't tested
193 invalid SID yet), but it applies the same weird delay of 24 frames to the
194 switchover point for each update as it does when generating CN for DTXd=0.
195
196 However, the part where Nokia's TFO transform (at least for FRv1) is plain
197 broken is CN muting in the case of lost SID updates. Here is what it does: it
198 decrements Xmaxcr by 4 (yes, by 4, not by 11) once every 24 frames (probably in
199 each TAF position), such that if the level of CN was very high before channel
200 breakdown, it will take up to 7.68 s before this CN is fully muted at the end
201 receiver.
202
203 GSM 06.12 section 5.4 says: "For the second lost SID frame, a muting technique
204 shall be used on the comfort noise that will gradually decrease the output
205 level, resulting in silencing of the output after a maximum of 320 ms." The
206 spec gives a maximum of 320 ms for total muting of CN, but with Nokia's TFO
207 transform in DTXd=1 case, that maximum time is 7.68 s - spec requirement
208 violated.
209
210 Only TFO, or regular FRv1 decoder too?
211 --------------------------------------
212
213 How does the regular FRv1 speech decoder (the one that ultimately emits G.711)
214 in Nokia TCSM2 TRAU implementation compare to what we've observed with their
215 TFO transform? Do they use a modular design where the regular decoder is a copy
216 of the same TFO transform followed by a standard GSM 06.10 decoder block, or do
217 they do something fancier?
218
219 Unfortunately we have no realistic way to answer this question: Nokia chose to
220 not implement the optional in-band homing mechanism for FRv1, thus we have no
221 way to pass test sequences through the TRAU in the decoder direction and see if
222 the output matches our hypothesis as to decoder logic. Hence the TFO transform
223 is the only part whose detailed behaviour we can realistically study in this
224 TRAU.
225
226 Take-away for Themyscira implementation
227 =======================================
228
229 My take-away points from the preceding examination of FRv1 TFO transform in
230 Nokia TCSM2 are:
231
232 * Our current Rx DTX handler front-end in libgsmfr2 is fine - Nokia's
233 implementation is not any fancier at least in the case of TFO.
234
235 * Modularity is a good thing, and so is consistency. There is nothing wrong
236 with using the same Rx DTX handler block both as our TFO transform and as the
237 front-end portion of the full decoder in end terminal operation.