diff Voice-memo-feature @ 96:69061d044f05

Voice-memo-feature: new article
author Mychaela Falconia <falcon@freecalypso.org>
date Tue, 27 Dec 2022 20:23:55 +0000
parents
children 80f0996bfd16
line wrap: on
line diff
--- /dev/null	Thu Jan 01 00:00:00 1970 +0000
+++ b/Voice-memo-feature	Tue Dec 27 20:23:55 2022 +0000
@@ -0,0 +1,217 @@
+The full Calypso hw+fw solution as delivered by TI (the relevant components here
+are the DSP, the official L1 code and RiViera Audio Service) implements an
+interesting feature called voice memos.  It is actually two paired features:
+
+* Voice memo recording: in almost all states of the MS (no GSM network at all,
+  or idle mode, or in an active call) it is possible to activate an extra
+  instance of GSM 06.10 encoder that takes input from the microphone (and also
+  from the active call downlink if invoked during a speech call) and writes its
+  output into an otherwise-unused DSP buffer.  The combination of L1 and RiViera
+  Audio Service then writes this speech recording into a file in FFS.
+
+* Voice memo playback: voice memo files recorded with the just-described VM
+  record feature can be played into the phone's speaker output.  The operation
+  of playing a previously recorded voice memo is conceptually no different from
+  playing tones or melodies, and can likewise be done in any state: with no GSM
+  network at all, in idle mode, or in an active call.
+
+VM recording and VM playback cannot be active at the same time: they use the
+same DSP buffer, and likely other mutually exclusive DSP resources too.
+Furthermore, the same DSP buffer that is used for these VM features is also
+used for TCH UL substitution debug/test feature described in the TCH-tap-modes
+article - therefore, all 3 features (VM record, VM play and TCH UL play) need
+to be treated as mutually exclusive in time.  However, aside from this mutual
+exclusion, it is very remarkable that VM recording or VM playback can be invoked
+during an active speech call (which can use any codec!), and the extra instance
+of FR1 encoder or decoder (always FR1) invoked by VM features is essentially
+independent from the main TCH encoder and the main TCH decoder, all of which
+run simultaneously.  It is worth noting that all newer GSM speech codecs (HR1,
+EFR and AMR) are much more computationally intensive than FR1, thus given that
+the DSP has the necessary horsepower to run any one of those "heavy" codecs, it
+probably isn't too much extra work to also run a simultaneous instance of
+unidirectional (encoder only or decoder only) FR1.
+
+The entire voice memo facility was already fully implemented in the TCS211 code
+delivery from TI, but prior to FreeCalypso there was no way to exercise it.  In
+order to exercise VM functionality in TCS211, one needs to invoke these RiViera
+Audio Service API functions:
+
+audio_vm_record_start()
+audio_vm_record_stop()
+audio_vm_play_start()
+audio_vm_play_stop()
+
+In FreeCalypso we've added some simple AT commands that call the just-listed API
+functions, and the facility that has been there all along is now accessible to
+play - it is the same situation as with Melody E1.
+
+FreeCalypso AT commands for voice memo testing
+==============================================
+
+AT@VMR="/pathname",dur,dtx
+
+This command initiates VM recording.  The FFS pathname into which the recording
+should be written must be given as a quoted string (and as a reminder, all FFS
+pathnames must be absolute - there are no current directories in the firmware
+architecture), and there is a second required argument that sets the maximum
+size of the recording.  The duration argument is a decimal integer, and it is
+reckoned in 1000-word units: if you specify duration as 1, the maximum recording
+size is 1000 words (2000 bytes), if you specify duration as 2, the maximum
+recording size is 2000 words (4000 bytes), and so forth.  If you record with DTX
+disabled, each block of 1000 words corresponds to 1 second in time (every 20 ms
+frame turns into a block of 20 words), thus with DTX disabled the duration
+argument becomes the actual duration in seconds.  However, if you record with
+DTX enabled, then periods of silence will be written in a compressed format
+described later in this article, and the time duration of the recording will
+depend on how much silence there is.
+
+The dtx argument is 1 to enable DTX or 0 to disable it; the default is DTX
+disabled.  The employed FR1 DTX algorithm appears to be the same as would be
+used for TCH/FS uplink, except that an "artificial" (there is no SACCH with
+independent-of-GSM voice memos) TAF position is generated on every 16th audio
+frame, i.e., every 320 ms.  (Note the shortening of this SID interval compared
+to official TCH, where it is 24 frames or 480 ms.)
+
+AT@VMRS
+
+This command stops any VM recording in progress, but it is rarely needed - the
+recording will stop automatically when the size limit is reached.
+
+AT@VMP="/pathname"
+
+This command initiates playback of the VM recording contained in the named file
+in FFS.  The FFS pathname is the only argument.
+
+AT@VMPS
+
+This command stops any VM playback in progress, but it is rarely needed - the
+playback will stop automatically when the end-marker is read from the file.
+
+Voice memo file format
+======================
+
+Using fc-fsio, you can read out voice memo files written by the VM record
+facility, and you can likewise construct your own memo files externally, upload
+them into FC device FFS and then play them via the VM play facility.  The format
+of these files is determined by TI's firmware stack (RV Audio Service on top of
+L1 on top of the DSP), but is fundamentally based on a DSP buffer that is just
+like those used for TCH.  The companion TCH-tap-modes article describes the
+format of the DSP buffer from which TCH DL bits can be read out; in the present
+article we are going to cover the differences specific to the voice memo
+facility.
+
+When VM recording is done with DTX disabled, every 20 ms speech frame turns into
+a block of 40 bytes in the memo file.  This block of 40 bytes is produced from
+20 16-bit words in the DSP buffer, each word turned into two bytes in LE order
+by the ARM part of Calypso.  The DSP buffer used for the VM facility has the
+same overall format as the one used for TCH DL, described in the TCH-tap-modes
+article - 3 status or header words followed by 17 words of payload, with the
+latter words carrying a 260-bit FR1 codec frame in the bit order of GSM 05.03
+interface 1.  As explained in the TCH-tap-modes article, speech codec payload
+words are filled in the msb-to-lsb direction by the DSP, thus the natural byte-
+oriented representation would be big-endian - but because the little-endian ARM
+core sits in between the DSP and the on-media file format, the byte order in
+voice memo files comes out "wrong".  Oh well - it is what it is.
+
+Of the 3 header words that precede every 20 ms speech frame, words 1 and 2
+appear to be dummies - they have meaning related to the channel decoder block
+in the case of TCH DL, but in the case of isolated-from-GSM voice memos, there
+does not seem to be any meaning.  However, header or status word 0, consisting
+of bit flags, is still important, but the bit flags for the VM facility are
+different from those of TCH DL.
+
+When VM recording is done with DTX disabled, status word 0 is observed to always
+equal 0xC400 on every frame.  However, when DTX is enabled, the following bits
+are seen in status word 0:
+
+* Bit 15 will be set if this frame needs to be saved in its entirety, or cleared
+  if it is to be skipped.  When VM recording code in L1S sees that the DSP has
+  delivered a frame with this status bit cleared, it will save only this status
+  word 0, i.e., 2 bytes will be written into the memo file instead of 40 bytes
+  for this 20 ms frame.  On VM playback, the code likewise checks this bit to
+  see how many words need to be read from the file, so synchronization is
+  maintained.
+
+* Bit 14 appears to be the SP flag of GSM 06.31 section 5.1: set when a speech
+  frame has been generated, or cleared when a SID frame has been generated
+  instead.
+
+* Bit 11 is a TAF-like flag: when DTX is enabled, this bit is set in every 16th
+  frame generated by the DSP in the VM recording session, otherwise it is
+  cleared.
+
+* Bit 10 will always be set in every status word 0 that gets written to voice
+  memo files: this bit is set by the DSP when it has finished encoding a 20 ms
+  audio frame and is checked by L1S on every TDMA frame, serving as a
+  synchronization mechanism telling L1S when it needs to copy a speech frame
+  from the DSP to the memo file.
+
+When VM recording is done with DTX enabled, the recorded memo file will consist
+of speech frames (header word 0xC400 or 0xCC00), SID frames (header word 0x8400
+or 0x8C00) and skipped frames consisting of only the header word 0x0400, with
+the remaining words omitted.  There will always be a present (not skipped) frame
+in every 16th position (0xCC00 or 0x8C00), thus no 0x0C00 frames are ever seen.
+
+Every voice memo binary file ends with a 0xFBFF end-marker word; this end-marker
+is needed because TCS211 fw architecture exhibits a separation between the
+actual data reading and writing processes in L1S and the FFS read/write
+interface provided by RiViera Audio Service, and because of this separation the
+operational code in L1S can't "see" an EOF condition at the file system level.
+
+FreeCalypso tools for decoding voice memo files
+===============================================
+
+If you have recorded a voice memo with AT@VMR and then read it out with fc-fsio,
+you can use additional FC tools to analyze it.  The following tools are
+available, split between FC host tools and GSM codec libs & utilities packages:
+
+* fc-vm2hex converts a binary VM recording into ASCII hex format, similar to
+  the old (2016) TCH DL recording format before it was extended in late 2022.
+  Every fully-written frame is emitted in the hex output as 3 space-separated
+  hex status words followed by a block of 66 hex digits giving the FR1 codec
+  frame in the unchanged bit order of TI's DSP, and every skipped frame (one
+  for which only status word 0 was written into the memo file) is emitted in
+  the hex output as just that one word.
+
+* gsmfr-dlcap-parse utility, originally written for parsing TCH DL capture
+  files, accepts TCH DL recording files in both old and new formats, and it also
+  accepts the output from fc-vm2hex as its input.  The combination of fc-vm2hex
+  and gsmfr-dlcap-parse allows a developer or tinkerer to do thorough human
+  analysis of TCS211 VM recordings in both DTX-disabled and DTX-enabled modes.
+
+* There will soon be a new fc-vm2gsmx utility that will read binary VM recording
+  files (as you would read out with fc-fsio) and convert them into extended-
+  libgsm (gsmx) format defined in our GSM codec libraries & utilities package.
+  This gsmx format is an extension of the classic libgsm (GSM 06.10) format,
+  adding the possibility of SID frames and BFI markers (frame gaps) in addition
+  to regular speech frames, thus it can represent the content of a voice memo
+  recording made in DTX mode.  These gsmx files can then be decoded into
+  playable WAV with our gsmfr-decode utility.
+
+FreeCalypso tools for external generation of voice memo files
+=============================================================
+
+Using FreeCalypso tools, you can produce an external speech recording in GSM
+06.10 FR1 codec format, convert it into TCS211 VM format, upload it into FC
+device FFS with fc-fsio, and then play these externally-produced voice memos
+with AT@VMP.  The steps are as follows:
+
+1) You can use gsmfr-encode to FR1-encode a speech sample from WAV into classic
+   .gsm format, or gsmfr-encode-r if the source is raw BE instead of WAV.
+   Alternatively, you can use any other off-the-shelf software that can encode
+   FR1 and write libgsm format; SoX shipped with Slackware includes the
+   necessary support.
+
+2) fc-gsm2vm converts a .gsm recording into non-DTX TCS211 VM format.
+
+At the present time we don't have any tools for producing external DTX-enabled
+VM recordings: the main limitation is that at least to this Mother's knowledge,
+the published source software community does not currently possess a GSM 06.10
+encoding library that has been extended with VAD and DTX functions.  There is
+classic libgsm from 1990s, used by everyone in the FOSS community who needs a
+GSM 06.10 encoder or decoder, but it doesn't do DTX; we (FreeCalypso and
+Themyscira Wireless) have produced our own libgsmfrp front-end that implements
+Rx DTX handler functions (that's how we can properly decode FR1 streams that
+contain SIDs and/or missing frames), but it doesn't help with DTX encoding.
+Therefore, our ability to produce TCS211-compatible VM recordings externally is
+currently limited to non-DTX mode.