comparison doc/TFO-xform/HRv1 @ 35:0979407719f0

doc/TFO-xform/HRv1: article written
author Mychaela Falconia <falcon@freecalypso.org>
date Mon, 02 Sep 2024 07:32:09 +0000
parents
children
comparison
equal deleted inserted replaced
34:35d38348c880 35:0979407719f0
1 HRv1: relation between regular end decoder and TFO transform
2 ============================================================
3
4 The reference decoder source published by ETSI in GSM 06.06 exhibits an almost
5 modular design: the Rx DTX handler front-end is almost a separable piece.
6 Breaking it down more precisely, we can make these observations:
7
8 0) Most aspects of bad frame handling and comfort noise generation are done by
9 generating new coded speech parameters, such that the output of those
10 algorithms can be packaged into new HRv1 codec frames to be sent to a distant
11 decoder. There are only two exceptions to this modularity:
12
13 1) Handling of unreliable speech frames (BFI=0 UFI=1 in speech rather than CN
14 state) has a modular and a non-modular aspect:
15
16 1a) Modular aspect: if R0 increment from the last good frame to the
17 unreliable frames exceeds a certain threshold, UFI is turned into BFI,
18 which is then handled in a fully modular fashion.
19
20 1b) Non-modular aspect: if the R0 increment does not meet the threshold for
21 turning UFI into BFI but meets another slightly lower threshold, a flag
22 is set that is passed into the guts of the speech decoder. That flag
23 effects speech muting on the decoder output level.
24
25 2) GSM 06.22 section 6.2 (Comfort noise generation and updating) says in the
26 very last sentence:
27
28 "When updating the comfort noise parameters (frame energy and LPC
29 coefficients), these parameters shall be interpolated over the SID update
30 period to obtain smooth transitions."
31
32 Note the change in language: the corresponding spec for FRv1 says "should
33 preferably", but the HRv1 spec says "shall". Furthermore, the bit-exact
34 implementation in the reference C code is considered normative in this
35 aspect, and is exercised by the test sequences of GSM 06.07.
36
37 This CN interpolation aspect is non-modular: R0 and the set of LPC
38 coefficients are decoded from bit parameters into linear form when CN frames
39 (initial and updates) are received, interpolation is done on this linear
40 form, and the interpolated values are passed to the main body of the speech
41 decoder.
42
43 Based on these observations, we can conclude that if we wish to detach this
44 reference Rx DTX handler for HRv1 from the reference decoder and make it into
45 an implementation of TFO transform for this codec, we have to solve two
46 problems:
47
48 1) Decide how to handle those UFI frames that aren't being turned into BFI;
49
50 2) Decide how to handle R0 and LPC parameters during CN insertion.
51
52 Nokia TCSM2 TRAU implementation
53 ===============================
54
55 Now that we have a working historical bank-of-TRAUs apparatus in our lab, let's
56 take a look at how this vendor (Nokia) implemented the TFO transform for HRv1
57 in their TRAU. Here are our findings:
58
59 * Handling of BFI=1 frames in speech state (not in DTX) exhibits a
60 simplification relative to GSM 06.06 reference code. The reference code
61 checks to see if the last saved frame and the received errored frame have the
62 same voiced vs unvoiced mode: if this mode matches, codevector parameters are
63 taken from the errored frame, otherwise the last saved frame is regurgitated
64 without taking any bits from the errored frame. Nokia's TFO transform always
65 does the latter (no bits are taken from the errored frame) irrespective of
66 voiced vs unvoiced mode matching or not.
67
68 * Aside from this just-described simplification, all other aspects of BFI=1
69 handling for speech frames appear to match the reference code.
70
71 * UFI handling appears to have been taken out altogether, even the part that
72 "upgrades" UFI to BFI when R0 increment is huge appears to have been omitted.
73 I fed a test sequence from TFO side that has a good speech frame with R0=2
74 followed by a UFI frame with R0=31, and the TRAU happily passed the latter
75 frame (now treated as perfectly good) to the DL output.
76
77 * Comfort noise generation (DTXd=0) is done exactly as the reference code would
78 do it, except that neither R0 nor LPC parameters are interpolated. During
79 each CN output interval between SID updates, R0 and LPC parameters in every
80 emitted CN frame are exactly equal to those received in the most recent SID
81 frame, as simple as that. When a new SID update comes in, the change in
82 emitted R0 and LPC is abrupt.
83
84 * The lost SID criterion for CN muting appears to be slightly different between
85 Nokia's TFO implementation and my reading of the spec and the reference C
86 code. My interpretation of GSM 06.22 spec sections 5.2.3 and 5.2.4 is that
87 unlike FR and EFR, in the case of HR codec the second lost SID (second
88 occurrence of BFI instead of SID update in TAF position) does _not_ trigger
89 CN muting; instead this muting is supposed to kick in on the _third_ lost SID
90 occurrence. (The difference in the spec was likely motivated by TAF positions
91 occurring every 240 ms with HR instead of every 480 ms with FR & EFR.) My
92 reading of the reference C code agrees with my reading of the spec - yet
93 Nokia's TFO implementation initiates CN muting in the frame following the
94 second lost SID, not third.
95
96 * Aside from the criterion for its initiation, the actual CN muting logic
97 behaves exactly like the reference C code: R0 is decremented by 2 on each
98 output frame following the TAF that initiates this sequence, and once R0
99 reaches 0, it stays there while this zero-magnitude CN output continues
100 indefinitely.
101
102 * With DTXd=1 CN output is replaced with repeated retransmission of the same
103 SID whose parameters would have been used for non-interpolated CN with DTXd=0,
104 which also agrees with the rules of GSM 08.62 section 8.2.2 paragraph 2.
105
106 * CN muting with DTXd=1 is implemented poorly. The TRAU emits SID frames with
107 R0 decrementing by 2 on each frame just like how it does for generated CN
108 output that's in the process of being slowly muted, but this design is a poor
109 choice: because the BTS will only transmit one of every 12 SID update frames
110 and the TRAU has no way of knowing which SID will be transmitted, slow
111 decrement cadence on SID frames themselves (not on CN output) makes no sense.
112
113 Thoughts for Themyscira implementation
114 ======================================
115
116 Prior to getting Nokia TCSM2 working in our lab and being able to experiment
117 with this TRAU, when I was contemplating the idea of potentially implementing
118 TFO transform for HRv1 in Themyscira libraries, my main trepidation was how to
119 produce comfort noise in the form of "speech" parameter output. For endpoint
120 decoders GSM 06.22 prescribes a bit-exact algorithm with interpolation, but
121 that smoothly interpolated CN cannot be readily expressed in terms of parameter
122 bits that can be packed into a new HRv1 codec frame. I thought about
123 requantizing the interpolated LPC reflection coefficients on every CN output
124 frame, using the same computationally intensive vector quantization algorithm
125 as in speech encoding - but because I am not an expert in codec design, it is
126 not obvious to me whether or not such approach would produce good results.
127
128 However, seeing that Nokia got away with simply passing R0 and LPC parameters
129 along from incoming SID frames to CN output without any interpolation or other
130 transformation gives us a huge confidence boost - if Nokia did it, so can we!
131 This approach is of course simple, and yields itself readily to elegant
132 implementation.
133
134 Seeing that Nokia got away with effectively discarding UFI in their TFO
135 transform is also a confidence boost - once again if Nokia did it, so can we.
136 I plan on keeping the logic that "upgrades" UFI to BFI under certain conditions
137 (not sure why Nokia omitted it), but the effect of potentially muting speech in
138 the guts of the decoder (past parameter-level manipulation) is not really
139 feasible to implement in a TFO transform.
140
141 Finally, regarding the logic that takes codevector parameters from errored
142 (BFI) frames when the voicing mode matches between the last saved frame and the
143 errored frame, the logic that exists in the reference C code but not in Nokia's
144 TFO transform: I plan on keeping this logic in our version, but Nokia's approach
145 will come in handy for handling BFI-no-data frames, a condition that does not
146 exist in TDM-based Abis transport or in TFO, but does unfortunately exist in
147 IP-based GSM RAN.