FreeCalypso > hg > gsm-net-reveng
comparison doc/TFO-xform/HRv1 @ 35:0979407719f0
doc/TFO-xform/HRv1: article written
author | Mychaela Falconia <falcon@freecalypso.org> |
---|---|
date | Mon, 02 Sep 2024 07:32:09 +0000 |
parents | |
children |
comparison
equal
deleted
inserted
replaced
34:35d38348c880 | 35:0979407719f0 |
---|---|
1 HRv1: relation between regular end decoder and TFO transform | |
2 ============================================================ | |
3 | |
4 The reference decoder source published by ETSI in GSM 06.06 exhibits an almost | |
5 modular design: the Rx DTX handler front-end is almost a separable piece. | |
6 Breaking it down more precisely, we can make these observations: | |
7 | |
8 0) Most aspects of bad frame handling and comfort noise generation are done by | |
9 generating new coded speech parameters, such that the output of those | |
10 algorithms can be packaged into new HRv1 codec frames to be sent to a distant | |
11 decoder. There are only two exceptions to this modularity: | |
12 | |
13 1) Handling of unreliable speech frames (BFI=0 UFI=1 in speech rather than CN | |
14 state) has a modular and a non-modular aspect: | |
15 | |
16 1a) Modular aspect: if R0 increment from the last good frame to the | |
17 unreliable frames exceeds a certain threshold, UFI is turned into BFI, | |
18 which is then handled in a fully modular fashion. | |
19 | |
20 1b) Non-modular aspect: if the R0 increment does not meet the threshold for | |
21 turning UFI into BFI but meets another slightly lower threshold, a flag | |
22 is set that is passed into the guts of the speech decoder. That flag | |
23 effects speech muting on the decoder output level. | |
24 | |
25 2) GSM 06.22 section 6.2 (Comfort noise generation and updating) says in the | |
26 very last sentence: | |
27 | |
28 "When updating the comfort noise parameters (frame energy and LPC | |
29 coefficients), these parameters shall be interpolated over the SID update | |
30 period to obtain smooth transitions." | |
31 | |
32 Note the change in language: the corresponding spec for FRv1 says "should | |
33 preferably", but the HRv1 spec says "shall". Furthermore, the bit-exact | |
34 implementation in the reference C code is considered normative in this | |
35 aspect, and is exercised by the test sequences of GSM 06.07. | |
36 | |
37 This CN interpolation aspect is non-modular: R0 and the set of LPC | |
38 coefficients are decoded from bit parameters into linear form when CN frames | |
39 (initial and updates) are received, interpolation is done on this linear | |
40 form, and the interpolated values are passed to the main body of the speech | |
41 decoder. | |
42 | |
43 Based on these observations, we can conclude that if we wish to detach this | |
44 reference Rx DTX handler for HRv1 from the reference decoder and make it into | |
45 an implementation of TFO transform for this codec, we have to solve two | |
46 problems: | |
47 | |
48 1) Decide how to handle those UFI frames that aren't being turned into BFI; | |
49 | |
50 2) Decide how to handle R0 and LPC parameters during CN insertion. | |
51 | |
52 Nokia TCSM2 TRAU implementation | |
53 =============================== | |
54 | |
55 Now that we have a working historical bank-of-TRAUs apparatus in our lab, let's | |
56 take a look at how this vendor (Nokia) implemented the TFO transform for HRv1 | |
57 in their TRAU. Here are our findings: | |
58 | |
59 * Handling of BFI=1 frames in speech state (not in DTX) exhibits a | |
60 simplification relative to GSM 06.06 reference code. The reference code | |
61 checks to see if the last saved frame and the received errored frame have the | |
62 same voiced vs unvoiced mode: if this mode matches, codevector parameters are | |
63 taken from the errored frame, otherwise the last saved frame is regurgitated | |
64 without taking any bits from the errored frame. Nokia's TFO transform always | |
65 does the latter (no bits are taken from the errored frame) irrespective of | |
66 voiced vs unvoiced mode matching or not. | |
67 | |
68 * Aside from this just-described simplification, all other aspects of BFI=1 | |
69 handling for speech frames appear to match the reference code. | |
70 | |
71 * UFI handling appears to have been taken out altogether, even the part that | |
72 "upgrades" UFI to BFI when R0 increment is huge appears to have been omitted. | |
73 I fed a test sequence from TFO side that has a good speech frame with R0=2 | |
74 followed by a UFI frame with R0=31, and the TRAU happily passed the latter | |
75 frame (now treated as perfectly good) to the DL output. | |
76 | |
77 * Comfort noise generation (DTXd=0) is done exactly as the reference code would | |
78 do it, except that neither R0 nor LPC parameters are interpolated. During | |
79 each CN output interval between SID updates, R0 and LPC parameters in every | |
80 emitted CN frame are exactly equal to those received in the most recent SID | |
81 frame, as simple as that. When a new SID update comes in, the change in | |
82 emitted R0 and LPC is abrupt. | |
83 | |
84 * The lost SID criterion for CN muting appears to be slightly different between | |
85 Nokia's TFO implementation and my reading of the spec and the reference C | |
86 code. My interpretation of GSM 06.22 spec sections 5.2.3 and 5.2.4 is that | |
87 unlike FR and EFR, in the case of HR codec the second lost SID (second | |
88 occurrence of BFI instead of SID update in TAF position) does _not_ trigger | |
89 CN muting; instead this muting is supposed to kick in on the _third_ lost SID | |
90 occurrence. (The difference in the spec was likely motivated by TAF positions | |
91 occurring every 240 ms with HR instead of every 480 ms with FR & EFR.) My | |
92 reading of the reference C code agrees with my reading of the spec - yet | |
93 Nokia's TFO implementation initiates CN muting in the frame following the | |
94 second lost SID, not third. | |
95 | |
96 * Aside from the criterion for its initiation, the actual CN muting logic | |
97 behaves exactly like the reference C code: R0 is decremented by 2 on each | |
98 output frame following the TAF that initiates this sequence, and once R0 | |
99 reaches 0, it stays there while this zero-magnitude CN output continues | |
100 indefinitely. | |
101 | |
102 * With DTXd=1 CN output is replaced with repeated retransmission of the same | |
103 SID whose parameters would have been used for non-interpolated CN with DTXd=0, | |
104 which also agrees with the rules of GSM 08.62 section 8.2.2 paragraph 2. | |
105 | |
106 * CN muting with DTXd=1 is implemented poorly. The TRAU emits SID frames with | |
107 R0 decrementing by 2 on each frame just like how it does for generated CN | |
108 output that's in the process of being slowly muted, but this design is a poor | |
109 choice: because the BTS will only transmit one of every 12 SID update frames | |
110 and the TRAU has no way of knowing which SID will be transmitted, slow | |
111 decrement cadence on SID frames themselves (not on CN output) makes no sense. | |
112 | |
113 Thoughts for Themyscira implementation | |
114 ====================================== | |
115 | |
116 Prior to getting Nokia TCSM2 working in our lab and being able to experiment | |
117 with this TRAU, when I was contemplating the idea of potentially implementing | |
118 TFO transform for HRv1 in Themyscira libraries, my main trepidation was how to | |
119 produce comfort noise in the form of "speech" parameter output. For endpoint | |
120 decoders GSM 06.22 prescribes a bit-exact algorithm with interpolation, but | |
121 that smoothly interpolated CN cannot be readily expressed in terms of parameter | |
122 bits that can be packed into a new HRv1 codec frame. I thought about | |
123 requantizing the interpolated LPC reflection coefficients on every CN output | |
124 frame, using the same computationally intensive vector quantization algorithm | |
125 as in speech encoding - but because I am not an expert in codec design, it is | |
126 not obvious to me whether or not such approach would produce good results. | |
127 | |
128 However, seeing that Nokia got away with simply passing R0 and LPC parameters | |
129 along from incoming SID frames to CN output without any interpolation or other | |
130 transformation gives us a huge confidence boost - if Nokia did it, so can we! | |
131 This approach is of course simple, and yields itself readily to elegant | |
132 implementation. | |
133 | |
134 Seeing that Nokia got away with effectively discarding UFI in their TFO | |
135 transform is also a confidence boost - once again if Nokia did it, so can we. | |
136 I plan on keeping the logic that "upgrades" UFI to BFI under certain conditions | |
137 (not sure why Nokia omitted it), but the effect of potentially muting speech in | |
138 the guts of the decoder (past parameter-level manipulation) is not really | |
139 feasible to implement in a TFO transform. | |
140 | |
141 Finally, regarding the logic that takes codevector parameters from errored | |
142 (BFI) frames when the voicing mode matches between the last saved frame and the | |
143 errored frame, the logic that exists in the reference C code but not in Nokia's | |
144 TFO transform: I plan on keeping this logic in our version, but Nokia's approach | |
145 will come in handy for handling BFI-no-data frames, a condition that does not | |
146 exist in TDM-based Abis transport or in TFO, but does unfortunately exist in | |
147 IP-based GSM RAN. |