# HG changeset patch
# User Mychaela Falconia <falcon@freecalypso.org>
# Date 1583728813 0
# Node ID 89ed8b374bc098bffdc947cf43ae0a4b6fe6cbbb
# Parent  be641fa7b68da662861648b0f9b7a9198fbbf1bd
doc/Loadtools-performance: finished updates for fc-host-tools-r13

diff -r be641fa7b68d -r 89ed8b374bc0 doc/Loadtools-performance
--- a/doc/Loadtools-performance	Sun Mar 08 23:06:03 2020 +0000
+++ b/doc/Loadtools-performance	Mon Mar 09 04:40:13 2020 +0000
@@ -81,32 +81,47 @@
 Notice the difference in flash programming times between GTA02 and FCDEV3B: the
 fw image size is almost exactly the same, any difference in latency between
 CP2102 and FT2232D is less likely to produce such significant time difference
-given our current 2048 byte transfer block size, thus the difference in physical
-flash program operation times between K5A3281CTM and S71PL129N flash chips seems
-to be the most likely explanation.
+given our current 2048 byte transfer block size (in fact fc-xram transfer times
+suggest that FT2232D is faster), thus the difference in physical flash program
+operation times between K5A3281CTM and S71PL129N flash chips seems to be the
+most likely explanation.
+
+It also needs to be noted that in the current version of fc-loadtool there is
+no difference in performance between flash program-bin, program-m0 and
+program-srec operations: they all use the same binary protocol with 2048 byte
+transfer block size.  There is no coupling between source S-records and flash
+programming operation blocks (2048-byte units) in the case of flash program-m0
+and program-srec: the new implementation of these commands prereads the entire
+S-record image as a separate preparatory step on the host side, the bits to be
+programmed are saved in a temporary binary file (automatically deleted
+afterward), and the actual flash programming operation proceeds from this
+internal binary source - but it knows about any discontiguous program regions
+and skips the gaps properly.
 
 XRAM loading via fc-xram
 ========================
 
-Our current fc-xram implementation is similar to the old 2013 implementation of
-flash program-m0 and program-srec commands in that fc-xram sends a separate ML
-command to loadagent for each S-record, thus the total XRAM image loading time
-is not only the serial bit transfer time, but also the overhead of command-
-response exchanges between fc-xram and loadagent.  The flash programming times
-listed above include flashing an FC Magnetite fw image into an FCDEV3B, which
-took 2m11s; doing an fc-xram load of the same FC Magnetite fw image (built as
-ramimage.srec) into the same FCDEV3B via the same FT2232D adapter at 812500
-baud takes 2m54s.
+The new version of fc-xram as of fc-host-tools-r13 is dramatically faster than
+the original implementation from 2013, using a new binary transfer protocol.
+The speed increase comes from not only switching from hex to binary, but even
+more so from eliminating the command-response turnaround time on every S3
+record.  The new XRAM loading times obtained on the Mother's Slackware 14.2
+host system are:
+
+Pirelli DP-L10 with built-in CP2102 USB-serial chip, 812500 baud, loading
+hybrid-vpm fw build, 49969 S3 records: 0m27s
 
-Why does XRAM loading take longer than flashing?  Shouldn't it be faster because
-the flash programming step on the target is replaced with a simple memcpy()?
-Answer: fc-xram is currently slower than flash program operations because the
-latter send 256 bytes at a time to loadagent, whereas fc-xram sends one
-S-record at a time; the division of the image into S-records is determined by
-the tool that generates the SREC image, but TI's hex470 post-linker generates
-images with 30 bytes of payload per S-record.  Having the operation proceed in
-smaller chunks increases the overhead of command-response exchanges and thus
-increases the overall time.
+FCDEV3B interfaced via FT2232D adapter, 812500 baud, loading hybrid fw build,
+78875 S3 records: 0m35m
+
+With the previous version of fc-xram these two loads took 1m40s and 2m54s,
+respectively.  With the current version of loadtools XRAM loading is faster
+than flash programming for the same fw image as one would naturally expect (the
+flash programming step on the target is replaced with a simple memcpy()
+operation), but in the previous version XRAM loading was slower because of
+massive command-response exchange overhead: there was a command-response
+turnaround time incurred for every S3 record, typically carrying only 30 bytes
+of payload.
 
 Additional complication with FTDI adapters and newer Linux kernel versions
 ==========================================================================
@@ -114,13 +129,16 @@
 If you are using an FTDI adapter and a Linux kernel version newer than early
 2017 (the change was introduced between 4.10 and 4.11), then you have one
 additional complication: a change was made to the ftdi_sio driver in the Linux
-kernel that makes many loadtools operations (basically everything other than
-flash dumps which are entirely target-driven) unbearably slow (much slower than
-the Slackware 14.2 reference times given above) unless you execute a special
-setserial command first.  After you plug in your FTDI-based USB-serial cable or
-connect the USB cable between your PC or laptop and your FTDI adapter board,
-causing the corresponding ttyUSBx device to appear, execute the following
-command:
+kernel that made many loadtools operations (basically everything other than
+flash dumps which are entirely target-driven) unbearably slow, at least with
+previous versions of loadtools that made many more command-response exchanges
+with loadagent for smaller transfer units and thus were much more sensitive to
+host system latency on these exchanges.  We do not yet know if this FTDI
+latency timer issue still has a significant negative impact or not with current
+loadtools, but if it does, the solution is to run a special setserial command.
+After you plug in your FTDI-based USB-serial cable or connect the USB cable
+between your PC or laptop and your FTDI adapter board, causing the
+corresponding ttyUSBx device to appear, execute the following command:
 
 setserial /dev/ttyUSBx low_latency
 
@@ -129,4 +147,6 @@
 should get performance that is comparable to what I get on classic Slackware.
 I say "hopefully" because I am not able to test it myself - I refuse to run any
 OS that can be categorized as "modern" - but field reports of performance on
-non-Slackware systems running newer Linux kernels (4.11 or later) are welcome.
+non-Slackware systems running newer Linux kernels (4.11 or later) are welcome,
+both with and without the low_latency setting.  Please be sure to include your
+Linux kernel version and your USB-serial adapter type in your report!