view doc/Loadtools-performance @ 860:7d4f080f66db

fc-add-ramps: actually produce correct output
author Mychaela Falconia <falcon@freecalypso.org>
date Sat, 18 Dec 2021 21:17:33 +0000
parents 4a1f0bbca58e
children
line wrap: on
line source

Memory dump performance
=======================

Here are the expected run times for the flash dump2bin operation of dumping the
entire flash content of a Calypso GSM device with the current version of
fc-loadtool which uses the new binary transfer protocol:

Dump of 4 MiB flash (e.g., Openmoko GTA01/02 or Mot C139/140) at 115200 baud:
6m04s

The same 4 MiB flash dump at 812500 baud: 0m52s

Dump of 8 MiB flash (e.g., Mot C155/156) at 812500 baud: 1m44s

These times are a 2x improvement compared to all previous versions of
fc-loadtool (prior to fc-host-tools-r13) which used a hex-based transfer
protocol.

Because of the architecture of fc-loadtool and its loadagent back-end, the run
time of a flash dump operation depends only on the serial baud rate and the
size of the flash area to be dumped; it should not depend on the USB-serial
adapter type or any host system properties, as long as the host system and
serial adapter combination supports the desired baud rate.  In contrast, flash
programming and fc-xram loading operations are quite different in that their
run times do depend on the host system and USB-serial adapter or other serial
port hardware - this host system dependency exists because of the way these
operations are implemented in our architecture.

Flash programming operations
============================

Here are some examples of expected flash programming times, all obtained on the
Mother's Slackware 14.2 host system:

Flashing an Openmoko GTA02 modem (K5A3281CTM flash chip) with a new firmware
image (2376448 bytes), using a PL2303 USB-serial cable at 115200 baud: 0m19s to
erase 37 sectors, 3m45s to program the image.

Flashing the same OM GTA02 modem with the same fw image, using a CP2102
USB-serial cable at 812500 baud: 0m19s to erase, 0m51s to program.

Flashing a Magnetite hybrid fw image (2378084 bytes) into an FCDEV3B board
(S71PL129N flash chip) via an FT2232D adapter at 812500 baud: 0m24s to erase
13 sectors (4 small and 9 large), 1m27s to program the image.

Regardless of whether you execute these two steps separately or use one of our
new flash e-program-{bin,m0,srec} commands, flash programming is always done in
two steps: first the erase operation covering the needed range of sectors, then
the actual programming operation that includes the data transfer.

Flash erase times are determined entirely by physical processes inside the
flash chip and thus should not be affected by software design or the serial
link: for each sector to be erased, fc-loadtool issues the sector erase command
to the flash chip and then polls the chip for operation completion status; the
polling is done over the serial link and thus may seem very slow, but the extra
bit of latency added by the finite polling speed is still negligible (at least
on the Mother's Slackware system) compared to the time of the actual sector
erase operation inside the flash chip.  One remaining flaw is that in our
current implementation the issuance of each individual sector erase command to
the flash chip takes 6 command-response exchanges between fc-loadtool and
loadagent; on my Slackware host system this extra overhead is still negligible
compared to the 0.5s or more for the actual erase operation time, but this
overhead may become more significant on host systems with higher latency.

After the erase operation, the execution time of the main flash programming
operation is a sum of 3 components:

* The time it takes for the bits to be transferred over the serial link;
* The time it takes for the flash programming operation to complete on the
  target (physics inside the flash chip);
* The overhead of command-response exchanges between fc-loadtool and loadagent.

Because image data transfer is taking place in this step, flash programming at
812500 baud is faster than 115200 baud, although it is not the same 7x
improvement as happens with flash dumps.  The present version of fc-loadtool
also uses a new binary transfer protocol instead of the hex-based one used in
previous versions (prior to fc-host-tools-r13); this change produces a 2x
improvement for OM GTA02 flashing, but only a smaller improvement for FCDEV3B
flashing.

Notice the difference in flash programming times between GTA02 and FCDEV3B: the
fw image size is almost exactly the same, any difference in latency between
CP2102 and FT2232D is less likely to produce such significant time difference
given our current 2048 byte transfer block size (in fact fc-xram transfer times
suggest that FT2232D is faster), thus the difference in physical flash program
operation times between K5A3281CTM and S71PL129N flash chips seems to be the
most likely explanation.

It also needs to be noted that in the current version of fc-loadtool there is
no difference in performance between flash program-bin, program-m0 and
program-srec operations: they all use the same binary protocol with 2048 byte
transfer block size.  There is no coupling between source S-records and flash
programming operation blocks (2048-byte units) in the case of flash program-m0
and program-srec: the new implementation of these commands prereads the entire
S-record image as a separate preparatory step on the host side, the bits to be
programmed are saved in a temporary binary file (automatically deleted
afterward), and the actual flash programming operation proceeds from this
internal binary source - but it knows about any discontiguous program regions
and skips the gaps properly.

XRAM loading via fc-xram
========================

The new version of fc-xram as of fc-host-tools-r13 is dramatically faster than
the original implementation from 2013, using a new binary transfer protocol.
The speed increase comes from not only switching from hex to binary, but even
more so from eliminating the command-response turnaround time on every S3
record.  The new XRAM loading times obtained on the Mother's Slackware 14.2
host system are:

Pirelli DP-L10 with built-in CP2102 USB-serial chip, 812500 baud, loading
hybrid-vpm fw build, 49969 S3 records: 0m27s

FCDEV3B interfaced via FT2232D adapter, 812500 baud, loading hybrid fw build,
78875 S3 records: 0m35s

With the previous version of fc-xram these two loads took 1m40s and 2m54s,
respectively.  With the current version of loadtools XRAM loading is faster
than flash programming for the same fw image as one would naturally expect (the
flash programming step on the target is replaced with a simple memcpy()
operation), but in the previous version XRAM loading was slower because of
massive command-response exchange overhead: there was a command-response
turnaround time incurred for every S3 record, typically carrying only 30 bytes
of payload.

Additional complication with FTDI adapters and newer Linux kernel versions
==========================================================================

If you are using an FTDI adapter and a Linux kernel version newer than early
2017 (the change was introduced between 4.10 and 4.11), then you have one
additional complication: a change was made to the ftdi_sio driver in the Linux
kernel that made many loadtools operations (basically everything other than
flash dumps which are entirely target-driven) unbearably slow, at least with
previous versions of loadtools that made many more command-response exchanges
with loadagent for smaller transfer units and thus were much more sensitive to
host system latency on these exchanges.  We do not yet know if this FTDI
latency timer issue still has a significant negative impact or not with current
loadtools, but if it does, the solution is to run a special setserial command.
After you plug in your FTDI-based USB-serial cable or connect the USB cable
between your PC or laptop and your FTDI adapter board, causing the
corresponding ttyUSBx device to appear, execute the following command:

setserial /dev/ttyUSBx low_latency

(Obviously change ttyUSBx to your actual ttyUSB number.)  Execute this
setserial command before running fc-loadtool or fc-xram, and then hopefully you
should get performance that is comparable to what I get on classic Slackware.
I say "hopefully" because I am not able to test it myself - I refuse to run any
OS that can be categorized as "modern" - but field reports of performance on
non-Slackware systems running newer Linux kernels (4.11 or later) are welcome,
both with and without the low_latency setting.  Please be sure to include your
Linux kernel version and your USB-serial adapter type in your report!