FreeCalypso > hg > freecalypso-tools
view doc/Loadtools-performance @ 678:f2a023c20653
doc/Loadtools-performance: removed the section about SREC programming
With our current fc-loadtool there is no difference in performance between
different flash programming commands, and the explanation of binary vs.
S-record issues has been moved to the new Binary-file-formats and
Flash-programming articles.
author | Mychaela Falconia <falcon@freecalypso.org> |
---|---|
date | Sun, 08 Mar 2020 22:56:31 +0000 |
parents | e66fafeeb377 |
children | 89ed8b374bc0 |
line wrap: on
line source
Memory dump performance ======================= Here are the expected run times for the flash dump2bin operation of dumping the entire flash content of a Calypso GSM device with the current version of fc-loadtool which uses the new binary transfer protocol: Dump of 4 MiB flash (e.g., Openmoko GTA01/02 or Mot C139/140) at 115200 baud: 6m4s The same 4 MiB flash dump at 812500 baud: 0m52s Dump of 8 MiB flash (e.g., Mot C155/156) at 812500 baud: 1m44s These times are a 2x improvement compared to all previous versions of fc-loadtool (prior to fc-host-tools-r13) which used a hex-based transfer protocol. Because of the architecture of fc-loadtool and its loadagent back-end, the run time of a flash dump operation depends only on the serial baud rate and the size of the flash area to be dumped; it should not depend on the USB-serial adapter type or any host system properties, as long as the host system and serial adapter combination supports the desired baud rate. In contrast, flash programming and fc-xram loading operations are quite different in that their run times do depend on the host system and USB-serial adapter or other serial port hardware - this host system dependency exists because of the way these operations are implemented in our architecture. Flash programming operations ============================ Here are some examples of expected flash programming times, all obtained on the Mother's Slackware 14.2 host system: Flashing an Openmoko GTA02 modem (K5A3281CTM flash chip) with a new firmware image (2376448 bytes), using a PL2303 USB-serial cable at 115200 baud: 0m19s to erase 37 sectors, 3m45s to program the image. Flashing the same OM GTA02 modem with the same fw image, using a CP2102 USB-serial cable at 812500 baud: 0m19s to erase, 0m51s to program. Flashing a Magnetite hybrid fw image (2378084 bytes) into an FCDEV3B board (S71PL129N flash chip) via an FT2232D adapter at 812500 baud: 0m24s to erase 13 sectors (4 small and 9 large), 1m27s to program the image. Regardless of whether you execute these two steps separately or use one of our new flash e-program-{bin,m0,srec} commands, flash programming is always done in two steps: first the erase operation covering the needed range of sectors, then the actual programming operation that includes the data transfer. Flash erase times are determined entirely by physical processes inside the flash chip and thus should not be affected by software design or the serial link: for each sector to be erased, fc-loadtool issues the sector erase command to the flash chip and then polls the chip for operation completion status; the polling is done over the serial link and thus may seem very slow, but the extra bit of latency added by the finite polling speed is still negligible (at least on the Mother's Slackware system) compared to the time of the actual sector erase operation inside the flash chip. One remaining flaw is that in our current implementation the issuance of each individual sector erase command to the flash chip takes 6 command-response exchanges between fc-loadtool and loadagent; on my Slackware host system this extra overhead is still negligible compared to the 0.5s or more for the actual erase operation time, but this overhead may become more significant on host systems with higher latency. After the erase operation, the execution time of the main flash programming operation is a sum of 3 components: * The time it takes for the bits to be transferred over the serial link; * The time it takes for the flash programming operation to complete on the target (physics inside the flash chip); * The overhead of command-response exchanges between fc-loadtool and loadagent. Because image data transfer is taking place in this step, flash programming at 812500 baud is faster than 115200 baud, although it is not the same 7x improvement as happens with flash dumps. The present version of fc-loadtool also uses a new binary transfer protocol instead of the hex-based one used in previous versions (prior to fc-host-tools-r13); this change produces a 2x improvement for OM GTA02 flashing, but only a smaller improvement for FCDEV3B flashing. Notice the difference in flash programming times between GTA02 and FCDEV3B: the fw image size is almost exactly the same, any difference in latency between CP2102 and FT2232D is less likely to produce such significant time difference given our current 2048 byte transfer block size, thus the difference in physical flash program operation times between K5A3281CTM and S71PL129N flash chips seems to be the most likely explanation. XRAM loading via fc-xram ======================== Our current fc-xram implementation is similar to the old 2013 implementation of flash program-m0 and program-srec commands in that fc-xram sends a separate ML command to loadagent for each S-record, thus the total XRAM image loading time is not only the serial bit transfer time, but also the overhead of command- response exchanges between fc-xram and loadagent. The flash programming times listed above include flashing an FC Magnetite fw image into an FCDEV3B, which took 2m11s; doing an fc-xram load of the same FC Magnetite fw image (built as ramimage.srec) into the same FCDEV3B via the same FT2232D adapter at 812500 baud takes 2m54s. Why does XRAM loading take longer than flashing? Shouldn't it be faster because the flash programming step on the target is replaced with a simple memcpy()? Answer: fc-xram is currently slower than flash program operations because the latter send 256 bytes at a time to loadagent, whereas fc-xram sends one S-record at a time; the division of the image into S-records is determined by the tool that generates the SREC image, but TI's hex470 post-linker generates images with 30 bytes of payload per S-record. Having the operation proceed in smaller chunks increases the overhead of command-response exchanges and thus increases the overall time. Additional complication with FTDI adapters and newer Linux kernel versions ========================================================================== If you are using an FTDI adapter and a Linux kernel version newer than early 2017 (the change was introduced between 4.10 and 4.11), then you have one additional complication: a change was made to the ftdi_sio driver in the Linux kernel that makes many loadtools operations (basically everything other than flash dumps which are entirely target-driven) unbearably slow (much slower than the Slackware 14.2 reference times given above) unless you execute a special setserial command first. After you plug in your FTDI-based USB-serial cable or connect the USB cable between your PC or laptop and your FTDI adapter board, causing the corresponding ttyUSBx device to appear, execute the following command: setserial /dev/ttyUSBx low_latency (Obviously change ttyUSBx to your actual ttyUSB number.) Execute this setserial command before running fc-loadtool or fc-xram, and then hopefully you should get performance that is comparable to what I get on classic Slackware. I say "hopefully" because I am not able to test it myself - I refuse to run any OS that can be categorized as "modern" - but field reports of performance on non-Slackware systems running newer Linux kernels (4.11 or later) are welcome.