# HG changeset patch # User Mychaela Falconia # Date 1583728813 0 # Node ID 89ed8b374bc098bffdc947cf43ae0a4b6fe6cbbb # Parent be641fa7b68da662861648b0f9b7a9198fbbf1bd doc/Loadtools-performance: finished updates for fc-host-tools-r13 diff -r be641fa7b68d -r 89ed8b374bc0 doc/Loadtools-performance --- a/doc/Loadtools-performance Sun Mar 08 23:06:03 2020 +0000 +++ b/doc/Loadtools-performance Mon Mar 09 04:40:13 2020 +0000 @@ -81,32 +81,47 @@ Notice the difference in flash programming times between GTA02 and FCDEV3B: the fw image size is almost exactly the same, any difference in latency between CP2102 and FT2232D is less likely to produce such significant time difference -given our current 2048 byte transfer block size, thus the difference in physical -flash program operation times between K5A3281CTM and S71PL129N flash chips seems -to be the most likely explanation. +given our current 2048 byte transfer block size (in fact fc-xram transfer times +suggest that FT2232D is faster), thus the difference in physical flash program +operation times between K5A3281CTM and S71PL129N flash chips seems to be the +most likely explanation. + +It also needs to be noted that in the current version of fc-loadtool there is +no difference in performance between flash program-bin, program-m0 and +program-srec operations: they all use the same binary protocol with 2048 byte +transfer block size. There is no coupling between source S-records and flash +programming operation blocks (2048-byte units) in the case of flash program-m0 +and program-srec: the new implementation of these commands prereads the entire +S-record image as a separate preparatory step on the host side, the bits to be +programmed are saved in a temporary binary file (automatically deleted +afterward), and the actual flash programming operation proceeds from this +internal binary source - but it knows about any discontiguous program regions +and skips the gaps properly. XRAM loading via fc-xram ======================== -Our current fc-xram implementation is similar to the old 2013 implementation of -flash program-m0 and program-srec commands in that fc-xram sends a separate ML -command to loadagent for each S-record, thus the total XRAM image loading time -is not only the serial bit transfer time, but also the overhead of command- -response exchanges between fc-xram and loadagent. The flash programming times -listed above include flashing an FC Magnetite fw image into an FCDEV3B, which -took 2m11s; doing an fc-xram load of the same FC Magnetite fw image (built as -ramimage.srec) into the same FCDEV3B via the same FT2232D adapter at 812500 -baud takes 2m54s. +The new version of fc-xram as of fc-host-tools-r13 is dramatically faster than +the original implementation from 2013, using a new binary transfer protocol. +The speed increase comes from not only switching from hex to binary, but even +more so from eliminating the command-response turnaround time on every S3 +record. The new XRAM loading times obtained on the Mother's Slackware 14.2 +host system are: + +Pirelli DP-L10 with built-in CP2102 USB-serial chip, 812500 baud, loading +hybrid-vpm fw build, 49969 S3 records: 0m27s -Why does XRAM loading take longer than flashing? Shouldn't it be faster because -the flash programming step on the target is replaced with a simple memcpy()? -Answer: fc-xram is currently slower than flash program operations because the -latter send 256 bytes at a time to loadagent, whereas fc-xram sends one -S-record at a time; the division of the image into S-records is determined by -the tool that generates the SREC image, but TI's hex470 post-linker generates -images with 30 bytes of payload per S-record. Having the operation proceed in -smaller chunks increases the overhead of command-response exchanges and thus -increases the overall time. +FCDEV3B interfaced via FT2232D adapter, 812500 baud, loading hybrid fw build, +78875 S3 records: 0m35m + +With the previous version of fc-xram these two loads took 1m40s and 2m54s, +respectively. With the current version of loadtools XRAM loading is faster +than flash programming for the same fw image as one would naturally expect (the +flash programming step on the target is replaced with a simple memcpy() +operation), but in the previous version XRAM loading was slower because of +massive command-response exchange overhead: there was a command-response +turnaround time incurred for every S3 record, typically carrying only 30 bytes +of payload. Additional complication with FTDI adapters and newer Linux kernel versions ========================================================================== @@ -114,13 +129,16 @@ If you are using an FTDI adapter and a Linux kernel version newer than early 2017 (the change was introduced between 4.10 and 4.11), then you have one additional complication: a change was made to the ftdi_sio driver in the Linux -kernel that makes many loadtools operations (basically everything other than -flash dumps which are entirely target-driven) unbearably slow (much slower than -the Slackware 14.2 reference times given above) unless you execute a special -setserial command first. After you plug in your FTDI-based USB-serial cable or -connect the USB cable between your PC or laptop and your FTDI adapter board, -causing the corresponding ttyUSBx device to appear, execute the following -command: +kernel that made many loadtools operations (basically everything other than +flash dumps which are entirely target-driven) unbearably slow, at least with +previous versions of loadtools that made many more command-response exchanges +with loadagent for smaller transfer units and thus were much more sensitive to +host system latency on these exchanges. We do not yet know if this FTDI +latency timer issue still has a significant negative impact or not with current +loadtools, but if it does, the solution is to run a special setserial command. +After you plug in your FTDI-based USB-serial cable or connect the USB cable +between your PC or laptop and your FTDI adapter board, causing the +corresponding ttyUSBx device to appear, execute the following command: setserial /dev/ttyUSBx low_latency @@ -129,4 +147,6 @@ should get performance that is comparable to what I get on classic Slackware. I say "hopefully" because I am not able to test it myself - I refuse to run any OS that can be categorized as "modern" - but field reports of performance on -non-Slackware systems running newer Linux kernels (4.11 or later) are welcome. +non-Slackware systems running newer Linux kernels (4.11 or later) are welcome, +both with and without the low_latency setting. Please be sure to include your +Linux kernel version and your USB-serial adapter type in your report!