view Voice-memo-feature @ 99:c28a1518d268

Speech-codec-selection: document AT%SPVER
author Mychaela Falconia <falcon@freecalypso.org>
date Tue, 06 Jun 2023 03:54:53 +0000
parents 80f0996bfd16
children
line wrap: on
line source

The full Calypso hw+fw solution as delivered by TI (the relevant components here
are the DSP, the official L1 code and RiViera Audio Service) implements an
interesting feature called voice memos.  It is actually two paired features:

* Voice memo recording: in almost all states of the MS (no GSM network at all,
  or idle mode, or in an active call) it is possible to activate an extra
  instance of GSM 06.10 encoder that takes input from the microphone (and also
  from the active call downlink if invoked during a speech call) and writes its
  output into an otherwise-unused DSP buffer.  The combination of L1 and RiViera
  Audio Service then writes this speech recording into a file in FFS.

* Voice memo playback: voice memo files recorded with the just-described VM
  record feature can be played into the phone's speaker output.  The operation
  of playing a previously recorded voice memo is conceptually no different from
  playing tones or melodies, and can likewise be done in any state: with no GSM
  network at all, in idle mode, or in an active call.

VM recording and VM playback cannot be active at the same time: they use the
same DSP buffer, and likely other mutually exclusive DSP resources too.
Furthermore, the same DSP buffer that is used for these VM features is also
used for TCH UL substitution debug/test feature described in the TCH-tap-modes
article - therefore, all 3 features (VM record, VM play and TCH UL play) need
to be treated as mutually exclusive in time.  However, aside from this mutual
exclusion, it is very remarkable that VM recording or VM playback can be invoked
during an active speech call (which can use any codec!), and the extra instance
of FR1 encoder or decoder (always FR1) invoked by VM features is essentially
independent from the main TCH encoder and the main TCH decoder, all of which
run simultaneously.  It is worth noting that all newer GSM speech codecs (HR1,
EFR and AMR) are much more computationally intensive than FR1, thus given that
the DSP has the necessary horsepower to run any one of those "heavy" codecs, it
probably isn't too much extra work to also run a simultaneous instance of
unidirectional (encoder only or decoder only) FR1.

The entire voice memo facility was already fully implemented in the TCS211 code
delivery from TI, but prior to FreeCalypso there was no way to exercise it.  In
order to exercise VM functionality in TCS211, one needs to invoke these RiViera
Audio Service API functions:

audio_vm_record_start()
audio_vm_record_stop()
audio_vm_play_start()
audio_vm_play_stop()

In FreeCalypso we've added some simple AT commands that call the just-listed API
functions, and the facility that has been there all along is now accessible to
play - it is the same situation as with Melody E1.

FreeCalypso AT commands for voice memo testing
==============================================

AT@VMR="/pathname",dur,dtx

This command initiates VM recording.  The FFS pathname into which the recording
should be written must be given as a quoted string (and as a reminder, all FFS
pathnames must be absolute - there are no current directories in the firmware
architecture), and there is a second required argument that sets the maximum
size of the recording.  The duration argument is a decimal integer, and it is
reckoned in 1000-word units: if you specify duration as 1, the maximum recording
size is 1000 words (2000 bytes), if you specify duration as 2, the maximum
recording size is 2000 words (4000 bytes), and so forth.  If you record with DTX
disabled, each block of 1000 words corresponds to 1 second in time (every 20 ms
frame turns into a block of 20 words), thus with DTX disabled the duration
argument becomes the actual duration in seconds.  However, if you record with
DTX enabled, then periods of silence will be written in a compressed format
described later in this article, and the time duration of the recording will
depend on how much silence there is.

The dtx argument is 1 to enable DTX or 0 to disable it; the default is DTX
disabled.  The employed FR1 DTX algorithm appears to be the same as would be
used for TCH/FS uplink, except that an "artificial" (there is no SACCH with
independent-of-GSM voice memos) TAF position is generated on every 16th audio
frame, i.e., every 320 ms.  (Note the shortening of this SID interval compared
to official TCH, where it is 24 frames or 480 ms.)

AT@VMRS

This command stops any VM recording in progress, but it is rarely needed - the
recording will stop automatically when the size limit is reached.

AT@VMP="/pathname"

This command initiates playback of the VM recording contained in the named file
in FFS.  The FFS pathname is the only argument.

AT@VMPS

This command stops any VM playback in progress, but it is rarely needed - the
playback will stop automatically when the end-marker is read from the file.

Voice memo file format
======================

Using fc-fsio, you can read out voice memo files written by the VM record
facility, and you can likewise construct your own memo files externally, upload
them into FC device FFS and then play them via the VM play facility.  The format
of these files is determined by TI's firmware stack (RV Audio Service on top of
L1 on top of the DSP), but is fundamentally based on a DSP buffer that is just
like those used for TCH.  The companion TCH-tap-modes article describes the
format of the DSP buffer from which TCH DL bits can be read out; in the present
article we are going to cover the differences specific to the voice memo
facility.

When VM recording is done with DTX disabled, every 20 ms speech frame turns into
a block of 40 bytes in the memo file.  This block of 40 bytes is produced from
20 16-bit words in the DSP buffer, each word turned into two bytes in LE order
by the ARM part of Calypso.  The DSP buffer used for the VM facility has the
same overall format as the one used for TCH DL, described in the TCH-tap-modes
article - 3 status or header words followed by 17 words of payload, with the
latter words carrying a 260-bit FR1 codec frame in the bit order of GSM 05.03
interface 1.  As explained in the TCH-tap-modes article, speech codec payload
words are filled in the msb-to-lsb direction by the DSP, thus the natural byte-
oriented representation would be big-endian - but because the little-endian ARM
core sits in between the DSP and the on-media file format, the byte order in
voice memo files comes out "wrong".  Oh well - it is what it is.

Of the 3 header words that precede every 20 ms speech frame, words 1 and 2
appear to be dummies - they have meaning related to the channel decoder block
in the case of TCH DL, but in the case of isolated-from-GSM voice memos, there
does not seem to be any meaning.  However, header or status word 0, consisting
of bit flags, is still important, but the bit flags for the VM facility are
different from those of TCH DL.

When VM recording is done with DTX disabled, status word 0 is observed to always
equal 0xC400 on every frame.  However, when DTX is enabled, the following bits
are seen in status word 0:

* Bit 15 will be set if this frame needs to be saved in its entirety, or cleared
  if it is to be skipped.  When VM recording code in L1S sees that the DSP has
  delivered a frame with this status bit cleared, it will save only this status
  word 0, i.e., 2 bytes will be written into the memo file instead of 40 bytes
  for this 20 ms frame.  On VM playback, the code likewise checks this bit to
  see how many words need to be read from the file, so synchronization is
  maintained.

* Bit 14 appears to be the SP flag of GSM 06.31 section 5.1: set when a speech
  frame has been generated, or cleared when a SID frame has been generated
  instead.

* Bit 11 is a TAF-like flag: when DTX is enabled, this bit is set in every 16th
  frame generated by the DSP in the VM recording session, otherwise it is
  cleared.

* Bit 10 will always be set in every status word 0 that gets written to voice
  memo files: this bit is set by the DSP when it has finished encoding a 20 ms
  audio frame and is checked by L1S on every TDMA frame, serving as a
  synchronization mechanism telling L1S when it needs to copy a speech frame
  from the DSP to the memo file.

When VM recording is done with DTX enabled, the recorded memo file will consist
of speech frames (header word 0xC400 or 0xCC00), SID frames (header word 0x8400
or 0x8C00) and skipped frames consisting of only the header word 0x0400, with
the remaining words omitted.  There will always be a present (not skipped) frame
in every 16th position (0xCC00 or 0x8C00), thus no 0x0C00 frames are ever seen.

Every voice memo binary file ends with a 0xFBFF end-marker word; this end-marker
is needed because TCS211 fw architecture exhibits a separation between the
actual data reading and writing processes in L1S and the FFS read/write
interface provided by RiViera Audio Service, and because of this separation the
operational code in L1S can't "see" an EOF condition at the file system level.

FreeCalypso tools for decoding voice memo files
===============================================

If you have recorded a voice memo with AT@VMR and then read it out with fc-fsio,
you can use additional FC tools to analyze it.  The following tools are
available, split between FC host tools and GSM codec libs & utilities packages:

* fc-vm2hex converts a binary VM recording into ASCII hex format, similar to
  the old (2016) TCH DL recording format before it was extended in late 2022.
  Every fully-written frame is emitted in the hex output as 3 space-separated
  hex status words followed by a block of 66 hex digits giving the FR1 codec
  frame in the unchanged bit order of TI's DSP, and every skipped frame (one
  for which only status word 0 was written into the memo file) is emitted in
  the hex output as just that one word.

* gsmfr-dlcap-parse utility, originally written for parsing TCH DL capture
  files, accepts TCH DL recording files in both old and new formats, and it also
  accepts the output from fc-vm2hex as its input.  The combination of fc-vm2hex
  and gsmfr-dlcap-parse allows a developer or tinkerer to do thorough human
  analysis of TCS211 VM recordings in both DTX-disabled and DTX-enabled modes.

* As of fc-host-tools-r18 there is a new fc-vm2gsmx utility that reads binary VM
  recording files (as you would read out with fc-fsio) and converts them into
  extended-libgsm (gsmx) format defined in our GSM codec libraries & utilities
  package.  This gsmx format is an extension of the classic libgsm (GSM 06.10)
  format, adding the possibility of SID frames and BFI markers (frame gaps) in
  addition to regular speech frames, thus it can represent the content of a
  voice memo recording made in DTX mode.  These gsmx files can then be decoded
  into playable WAV with our gsmfr-decode utility.

FreeCalypso tools for external generation of voice memo files
=============================================================

Using FreeCalypso tools, you can produce an external speech recording in GSM
06.10 FR1 codec format, convert it into TCS211 VM format, upload it into FC
device FFS with fc-fsio, and then play these externally-produced voice memos
with AT@VMP.  The steps are as follows:

1) You can use gsmfr-encode to FR1-encode a speech sample from WAV into classic
   .gsm format, or gsmfr-encode-r if the source is raw BE instead of WAV.
   Alternatively, you can use any other off-the-shelf software that can encode
   FR1 and write libgsm format; SoX shipped with Slackware includes the
   necessary support.

2) fc-gsm2vm converts a .gsm recording into non-DTX TCS211 VM format.

At the present time we don't have any tools for producing external DTX-enabled
VM recordings: the main limitation is that at least to this Mother's knowledge,
the published source software community does not currently possess a GSM 06.10
encoding library that has been extended with VAD and DTX functions.  There is
classic libgsm from 1990s, used by everyone in the FOSS community who needs a
GSM 06.10 encoder or decoder, but it doesn't do DTX; we (FreeCalypso and
Themyscira Wireless) have produced our own libgsmfrp front-end that implements
Rx DTX handler functions (that's how we can properly decode FR1 streams that
contain SIDs and/or missing frames), but it doesn't help with DTX encoding.
Therefore, our ability to produce TCS211-compatible VM recordings externally is
currently limited to non-DTX mode.