FreeCalypso > hg > freecalypso-docs
view Voice-memo-feature @ 98:915ff61137ee
Speech-codec-selection: document MSCAP
author | Mychaela Falconia <falcon@freecalypso.org> |
---|---|
date | Tue, 06 Jun 2023 01:47:36 +0000 |
parents | 80f0996bfd16 |
children |
line wrap: on
line source
The full Calypso hw+fw solution as delivered by TI (the relevant components here are the DSP, the official L1 code and RiViera Audio Service) implements an interesting feature called voice memos. It is actually two paired features: * Voice memo recording: in almost all states of the MS (no GSM network at all, or idle mode, or in an active call) it is possible to activate an extra instance of GSM 06.10 encoder that takes input from the microphone (and also from the active call downlink if invoked during a speech call) and writes its output into an otherwise-unused DSP buffer. The combination of L1 and RiViera Audio Service then writes this speech recording into a file in FFS. * Voice memo playback: voice memo files recorded with the just-described VM record feature can be played into the phone's speaker output. The operation of playing a previously recorded voice memo is conceptually no different from playing tones or melodies, and can likewise be done in any state: with no GSM network at all, in idle mode, or in an active call. VM recording and VM playback cannot be active at the same time: they use the same DSP buffer, and likely other mutually exclusive DSP resources too. Furthermore, the same DSP buffer that is used for these VM features is also used for TCH UL substitution debug/test feature described in the TCH-tap-modes article - therefore, all 3 features (VM record, VM play and TCH UL play) need to be treated as mutually exclusive in time. However, aside from this mutual exclusion, it is very remarkable that VM recording or VM playback can be invoked during an active speech call (which can use any codec!), and the extra instance of FR1 encoder or decoder (always FR1) invoked by VM features is essentially independent from the main TCH encoder and the main TCH decoder, all of which run simultaneously. It is worth noting that all newer GSM speech codecs (HR1, EFR and AMR) are much more computationally intensive than FR1, thus given that the DSP has the necessary horsepower to run any one of those "heavy" codecs, it probably isn't too much extra work to also run a simultaneous instance of unidirectional (encoder only or decoder only) FR1. The entire voice memo facility was already fully implemented in the TCS211 code delivery from TI, but prior to FreeCalypso there was no way to exercise it. In order to exercise VM functionality in TCS211, one needs to invoke these RiViera Audio Service API functions: audio_vm_record_start() audio_vm_record_stop() audio_vm_play_start() audio_vm_play_stop() In FreeCalypso we've added some simple AT commands that call the just-listed API functions, and the facility that has been there all along is now accessible to play - it is the same situation as with Melody E1. FreeCalypso AT commands for voice memo testing ============================================== AT@VMR="/pathname",dur,dtx This command initiates VM recording. The FFS pathname into which the recording should be written must be given as a quoted string (and as a reminder, all FFS pathnames must be absolute - there are no current directories in the firmware architecture), and there is a second required argument that sets the maximum size of the recording. The duration argument is a decimal integer, and it is reckoned in 1000-word units: if you specify duration as 1, the maximum recording size is 1000 words (2000 bytes), if you specify duration as 2, the maximum recording size is 2000 words (4000 bytes), and so forth. If you record with DTX disabled, each block of 1000 words corresponds to 1 second in time (every 20 ms frame turns into a block of 20 words), thus with DTX disabled the duration argument becomes the actual duration in seconds. However, if you record with DTX enabled, then periods of silence will be written in a compressed format described later in this article, and the time duration of the recording will depend on how much silence there is. The dtx argument is 1 to enable DTX or 0 to disable it; the default is DTX disabled. The employed FR1 DTX algorithm appears to be the same as would be used for TCH/FS uplink, except that an "artificial" (there is no SACCH with independent-of-GSM voice memos) TAF position is generated on every 16th audio frame, i.e., every 320 ms. (Note the shortening of this SID interval compared to official TCH, where it is 24 frames or 480 ms.) AT@VMRS This command stops any VM recording in progress, but it is rarely needed - the recording will stop automatically when the size limit is reached. AT@VMP="/pathname" This command initiates playback of the VM recording contained in the named file in FFS. The FFS pathname is the only argument. AT@VMPS This command stops any VM playback in progress, but it is rarely needed - the playback will stop automatically when the end-marker is read from the file. Voice memo file format ====================== Using fc-fsio, you can read out voice memo files written by the VM record facility, and you can likewise construct your own memo files externally, upload them into FC device FFS and then play them via the VM play facility. The format of these files is determined by TI's firmware stack (RV Audio Service on top of L1 on top of the DSP), but is fundamentally based on a DSP buffer that is just like those used for TCH. The companion TCH-tap-modes article describes the format of the DSP buffer from which TCH DL bits can be read out; in the present article we are going to cover the differences specific to the voice memo facility. When VM recording is done with DTX disabled, every 20 ms speech frame turns into a block of 40 bytes in the memo file. This block of 40 bytes is produced from 20 16-bit words in the DSP buffer, each word turned into two bytes in LE order by the ARM part of Calypso. The DSP buffer used for the VM facility has the same overall format as the one used for TCH DL, described in the TCH-tap-modes article - 3 status or header words followed by 17 words of payload, with the latter words carrying a 260-bit FR1 codec frame in the bit order of GSM 05.03 interface 1. As explained in the TCH-tap-modes article, speech codec payload words are filled in the msb-to-lsb direction by the DSP, thus the natural byte- oriented representation would be big-endian - but because the little-endian ARM core sits in between the DSP and the on-media file format, the byte order in voice memo files comes out "wrong". Oh well - it is what it is. Of the 3 header words that precede every 20 ms speech frame, words 1 and 2 appear to be dummies - they have meaning related to the channel decoder block in the case of TCH DL, but in the case of isolated-from-GSM voice memos, there does not seem to be any meaning. However, header or status word 0, consisting of bit flags, is still important, but the bit flags for the VM facility are different from those of TCH DL. When VM recording is done with DTX disabled, status word 0 is observed to always equal 0xC400 on every frame. However, when DTX is enabled, the following bits are seen in status word 0: * Bit 15 will be set if this frame needs to be saved in its entirety, or cleared if it is to be skipped. When VM recording code in L1S sees that the DSP has delivered a frame with this status bit cleared, it will save only this status word 0, i.e., 2 bytes will be written into the memo file instead of 40 bytes for this 20 ms frame. On VM playback, the code likewise checks this bit to see how many words need to be read from the file, so synchronization is maintained. * Bit 14 appears to be the SP flag of GSM 06.31 section 5.1: set when a speech frame has been generated, or cleared when a SID frame has been generated instead. * Bit 11 is a TAF-like flag: when DTX is enabled, this bit is set in every 16th frame generated by the DSP in the VM recording session, otherwise it is cleared. * Bit 10 will always be set in every status word 0 that gets written to voice memo files: this bit is set by the DSP when it has finished encoding a 20 ms audio frame and is checked by L1S on every TDMA frame, serving as a synchronization mechanism telling L1S when it needs to copy a speech frame from the DSP to the memo file. When VM recording is done with DTX enabled, the recorded memo file will consist of speech frames (header word 0xC400 or 0xCC00), SID frames (header word 0x8400 or 0x8C00) and skipped frames consisting of only the header word 0x0400, with the remaining words omitted. There will always be a present (not skipped) frame in every 16th position (0xCC00 or 0x8C00), thus no 0x0C00 frames are ever seen. Every voice memo binary file ends with a 0xFBFF end-marker word; this end-marker is needed because TCS211 fw architecture exhibits a separation between the actual data reading and writing processes in L1S and the FFS read/write interface provided by RiViera Audio Service, and because of this separation the operational code in L1S can't "see" an EOF condition at the file system level. FreeCalypso tools for decoding voice memo files =============================================== If you have recorded a voice memo with AT@VMR and then read it out with fc-fsio, you can use additional FC tools to analyze it. The following tools are available, split between FC host tools and GSM codec libs & utilities packages: * fc-vm2hex converts a binary VM recording into ASCII hex format, similar to the old (2016) TCH DL recording format before it was extended in late 2022. Every fully-written frame is emitted in the hex output as 3 space-separated hex status words followed by a block of 66 hex digits giving the FR1 codec frame in the unchanged bit order of TI's DSP, and every skipped frame (one for which only status word 0 was written into the memo file) is emitted in the hex output as just that one word. * gsmfr-dlcap-parse utility, originally written for parsing TCH DL capture files, accepts TCH DL recording files in both old and new formats, and it also accepts the output from fc-vm2hex as its input. The combination of fc-vm2hex and gsmfr-dlcap-parse allows a developer or tinkerer to do thorough human analysis of TCS211 VM recordings in both DTX-disabled and DTX-enabled modes. * As of fc-host-tools-r18 there is a new fc-vm2gsmx utility that reads binary VM recording files (as you would read out with fc-fsio) and converts them into extended-libgsm (gsmx) format defined in our GSM codec libraries & utilities package. This gsmx format is an extension of the classic libgsm (GSM 06.10) format, adding the possibility of SID frames and BFI markers (frame gaps) in addition to regular speech frames, thus it can represent the content of a voice memo recording made in DTX mode. These gsmx files can then be decoded into playable WAV with our gsmfr-decode utility. FreeCalypso tools for external generation of voice memo files ============================================================= Using FreeCalypso tools, you can produce an external speech recording in GSM 06.10 FR1 codec format, convert it into TCS211 VM format, upload it into FC device FFS with fc-fsio, and then play these externally-produced voice memos with AT@VMP. The steps are as follows: 1) You can use gsmfr-encode to FR1-encode a speech sample from WAV into classic .gsm format, or gsmfr-encode-r if the source is raw BE instead of WAV. Alternatively, you can use any other off-the-shelf software that can encode FR1 and write libgsm format; SoX shipped with Slackware includes the necessary support. 2) fc-gsm2vm converts a .gsm recording into non-DTX TCS211 VM format. At the present time we don't have any tools for producing external DTX-enabled VM recordings: the main limitation is that at least to this Mother's knowledge, the published source software community does not currently possess a GSM 06.10 encoding library that has been extended with VAD and DTX functions. There is classic libgsm from 1990s, used by everyone in the FOSS community who needs a GSM 06.10 encoder or decoder, but it doesn't do DTX; we (FreeCalypso and Themyscira Wireless) have produced our own libgsmfrp front-end that implements Rx DTX handler functions (that's how we can properly decode FR1 streams that contain SIDs and/or missing frames), but it doesn't help with DTX encoding. Therefore, our ability to produce TCS211-compatible VM recordings externally is currently limited to non-DTX mode.