comparison doc/Loadtools-performance @ 630:8c6e7b7e701c

doc/Loadtools-performance: updates for new program-m0 and setserial
author Mychaela Falconia <falcon@freecalypso.org>
date Sat, 29 Feb 2020 21:22:27 +0000
parents 6824c4d55848
children e66fafeeb377
comparison
equal deleted inserted replaced
629:0f70fe9395c4 630:8c6e7b7e701c
1 Dumping and programming flash
2 =============================
3
1 Here are the expected run times for the flash dump2bin operation of dumping the 4 Here are the expected run times for the flash dump2bin operation of dumping the
2 entire flash content of a Calypso GSM device: 5 entire flash content of a Calypso GSM device:
3 6
4 Dump of 4 MiB flash (e.g., Openmoko GTA01/02 or Mot C139/140) at 115200 baud: 7 Dump of 4 MiB flash (e.g., Openmoko GTA01/02 or Mot C139/140) at 115200 baud:
5 12m53s 8 12m53s
17 run times do depend on the host system and USB-serial adapter or other serial 20 run times do depend on the host system and USB-serial adapter or other serial
18 port hardware - this host system dependency exists because of the way these 21 port hardware - this host system dependency exists because of the way these
19 operations are implemented in our architecture. 22 operations are implemented in our architecture.
20 23
21 Here are some examples of expected flash programming times, all obtained on the 24 Here are some examples of expected flash programming times, all obtained on the
22 Mother's Slackware 14.2 host system, using the flash program-bin command as 25 Mother's Slackware 14.2 host system:
23 opposed to program-m0 or program-srec:
24 26
25 Flashing an Openmoko GTA02 modem (K5A3281CTM flash chip) with a new firmware 27 Flashing an Openmoko GTA02 modem (K5A3281CTM flash chip) with a new firmware
26 image (2376448 bytes), using a PL2303 USB-serial cable at 115200 baud: 7m35s 28 image (2376448 bytes), using a PL2303 USB-serial cable at 115200 baud: 7m35s
27 29
28 Flashing the same OM GTA02 modem with the same fw image, using a CP2102 30 Flashing the same OM GTA02 modem with the same fw image, using a CP2102
45 * The time it takes for the bits to be transferred over the serial link; 47 * The time it takes for the bits to be transferred over the serial link;
46 * The time it takes for the flash programming operation to complete on the 48 * The time it takes for the flash programming operation to complete on the
47 target (physics inside the flash chip); 49 target (physics inside the flash chip);
48 * The overhead of command-response exchanges between fc-loadtool and loadagent. 50 * The overhead of command-response exchanges between fc-loadtool and loadagent.
49 51
50 If you are starting out with a firmware image in m0 format, converting it to 52 Programming flash using program-m0 or program-srec
51 binary with mokosrec2bin (like our FC Magnetite build system always does) and 53 ==================================================
52 then flashing via program-bin is faster than flashing the original m0 image
53 directly via program-m0. Following the last example above of flashing a
54 Magnetite hybrid fw image into an FCDEV3B, the flashing operation via
55 program-bin took 2m11s; flashing the same image via program-m0 took 3m54s.
56 54
57 Flashing via program-bin is faster than program-m0 or program-srec because the 55 Prior to fc-host-tools-r12 flash programming via flash program-m0 or
58 program-bin operation uses a larger unit size internally. fc-loadtool 56 program-srec commands was much slower than flash program-bin. The reason for
59 implements all flash programming operations by sending AMFW or INFW commands to 57 this performance discrepancy was that the original implementation of these
60 loadagent; each AMFW or INFW command carries a string of 16-bit words to be 58 commands from 2013 was very straightforward: they operated in one pass, reading
61 programmed. Our program-bin operation programs 256 bytes at a time, i.e., 59 the S-record image file, and as each individual S-record was read, it was turned
62 sends one AMFW or INFW command per 256 bytes of image payload; our program-m0 60 into an AMFW or INFW command to loadagent. In the case of *.m0 files generated
63 and program-srec operations program one S-record at a time, i.e., each S-record 61 by TI's hex470 post-linker, each S-record carries 30 bytes of payload, thus the
64 in the source image turns into its own AMFW or INFW command to loadagent. In 62 flashing operation proceeded in 30-byte units, incurring the overhead of a
65 the case of m0 images produced by TI's hex470 post-linker, each S-record carries 63 command-response exchange for every 30 bytes. In contrast, our current flash
66 30 bytes of payload, thus flashing that m0 image directly with program-m0 will 64 program-bin implementation sends 256 bytes of payload per each AMFW or INFW
67 proceed in 30-byte units, whereas converting it to binary and then flashing with 65 command; this larger unit size decreases the overhead of command-response
68 program-bin will proceed in 256-byte units. The smaller unit size slows down 66 exchanges between fc-loadtool and loadagent.
69 the overall operation by increasing the overhead of command-response exchanges.
70 67
71 XRAM loading via fc-xram is similar to flash program-m0 and program-srec in that 68 Why do we need flash program-m0 and program-srec commands at all, why not
72 fc-xram sends a separate ML command to loadagent for each S-record, thus the 69 simply convert all SREC images to straight binary first and then program with
73 total XRAM image loading time is not only the serial bit transfer time, but also 70 flash program-bin? The reason is that S-record images can contain multiple
74 the overhead of command-response exchanges between fc-xram and loadagent. Going 71 discontiguous program regions with gaps in between. All of our current
75 back to the same FC Magnetite fw image that can be flashed into an FCDEV3B in 72 FreeCalypso firmwares built with TI's TMS470 toolchain contain a few small gaps
76 2m11s via program-bin or in 3m54s via program-m0, doing an fc-xram load of that 73 in the fwimage.m0 file, filled with 0xFF bytes when converted to straight binary
77 same fw image (built as ramimage.srec) into the same FCDEV3B via the same 74 with mokosrec2bin, but TI's own firmwares built for 8 MiB flash configurations
78 FT2232D adapter at 812500 baud takes 2m54s - thus we can see that fc-xram 75 often had much bigger gaps in them.
79 loading is faster than flash program-m0 or program-srec, but slower than flash 76
80 program-bin. 77 As of fc-host-tools-r12 we finally have a more efficient solution for flashing
78 discontiguous SREC images: our new implementation of flash program-m0 and
79 program-srec commands begins with a preliminary pass (pure host operation, no
80 target interaction) of reading the S-record image file; the payload bits are
81 written into a temporary binary file (automatically deleted afterward), while
82 the address and length of each discontiguous region are remembered internally.
83 Then the actual flash programming operation proceeds just like program-bin,
84 reading from the internal binary file and sending 256 bytes of payload at a time
85 to loadagent, but using the remembered knowledge of where the discontiguous
86 regions lie.
87
88 XRAM loading via fc-xram
89 ========================
90
91 Our current fc-xram implementation is similar to the old 2013 implementation of
92 flash program-m0 and program-srec commands in that fc-xram sends a separate ML
93 command to loadagent for each S-record, thus the total XRAM image loading time
94 is not only the serial bit transfer time, but also the overhead of command-
95 response exchanges between fc-xram and loadagent. The flash programming times
96 listed above include flashing an FC Magnetite fw image into an FCDEV3B, which
97 took 2m11s; doing an fc-xram load of the same FC Magnetite fw image (built as
98 ramimage.srec) into the same FCDEV3B via the same FT2232D adapter at 812500
99 baud takes 2m54s.
81 100
82 Why does XRAM loading take longer than flashing? Shouldn't it be faster because 101 Why does XRAM loading take longer than flashing? Shouldn't it be faster because
83 the flash programming step on the target is replaced with a simple memcpy()? 102 the flash programming step on the target is replaced with a simple memcpy()?
84 Answer: fc-xram is currently slower than flash program-bin because the latter 103 Answer: fc-xram is currently slower than flash program operations because the
85 sends 256 bytes at a time to loadagent, whereas fc-xram sends one S-record at a 104 latter send 256 bytes at a time to loadagent, whereas fc-xram sends one
86 time; the division of the image into S-records is determined by the tool that 105 S-record at a time; the division of the image into S-records is determined by
87 generates the SREC image, but TI's hex470 post-linker generates images with 30 106 the tool that generates the SREC image, but TI's hex470 post-linker generates
88 bytes of payload per S-record. Having the operation proceed in smaller chunks 107 images with 30 bytes of payload per S-record. Having the operation proceed in
89 increases the overhead of command-response exchanges and thus increases the 108 smaller chunks increases the overhead of command-response exchanges and thus
90 overall time. 109 increases the overall time.
110
111 Additional complication with FTDI adapters and newer Linux kernel versions
112 ==========================================================================
113
114 If you are using an FTDI adapter and a Linux kernel version newer than early
115 2017 (the change was introduced between 4.10 and 4.11), then you have one
116 additional complication: a change was made to the ftdi_sio driver in the Linux
117 kernel that makes many loadtools operations (basically everything other than
118 flash dumps which are entirely target-driven) unbearably slow (much slower than
119 the Slackware 14.2 reference times given above) unless you execute a special
120 setserial command first. After you plug in your FTDI-based USB-serial cable or
121 connect the USB cable between your PC or laptop and your FTDI adapter board,
122 causing the corresponding ttyUSBx device to appear, execute the following
123 command:
124
125 setserial /dev/ttyUSBx low_latency
126
127 (Obviously change ttyUSBx to your actual ttyUSB number.) Execute this
128 setserial command before running fc-loadtool or fc-xram, and then hopefully you
129 should get performance that is comparable to what I get on classic Slackware.
130 I say "hopefully" because I am not able to test it myself - I refuse to run any
131 OS that can be categorized as "modern" - but field reports of performance on
132 non-Slackware systems running newer Linux kernels (4.11 or later) are welcome.