# HG changeset patch # User Mychaela Falconia # Date 1613119331 0 # Node ID ec184dad4877b287d9a70ee133c8e228ed9d900a # Parent ac33ec9a07d958eef94b8a80ad535d225164f66c SIM-data-formats article written diff -r ac33ec9a07d9 -r ec184dad4877 SIM-data-formats --- /dev/null Thu Jan 01 00:00:00 1970 +0000 +++ b/SIM-data-formats Fri Feb 12 08:42:11 2021 +0000 @@ -0,0 +1,181 @@ +FreeCalypso is developing a family of several different tools that operate on +SIM cards and user data (primarily phonebooks) stored in them, accessing the +same underlying data through various mechanisms: + +* Our current fc-simtool utility operates on SIM cards inserted into a smart + card "reader" device, without going through any kind of phone or other GSM + device - most direct manipulation of SIM user data content. + +* We have plans to develop a companion utility (tentatively named fc-simint) + that will operate on SIM cards inserted into Calypso phones or FC modem + boards, working on the same principle as fc-loadtool (suspending and bypassing + the Calypso device's regular operational firmware), but operating on the + device's SIM interface rather than its flash. This companion utility is + planned to replicate the end-user-oriented functionality of fc-simtool. + +* We have a FreeCalypso User Phone Tools suite that communicates with FC modem + boards and the future FC phone handset via AT commands. We have plans to add + phonebook manipulation commands to this suite (based on AT+CPBR and AT+CPBW), + reading and writing phonebook data files in the same format as fc-simtool. + +Because we have several different tools (some already written, others only +planned) that will need to read and write exactly the same data formats, and +because these tools will have to live in different source repositories (totally +different underlying hardware and system library requirements), the data format +specification needs to be global and independent of particular hw tools - it is +the present document. + +GSM 03.38 / 23.038 string representation +======================================== + +The world of GSM does not use ASCII - in all places where ASCII strings would +appear in the world of ordinary computing, GSM uses its own different 7-bit +character set instead, defined in GSM TS 03.38 or 3GPP TS 23.038. Many SIM card +data files (including phonebooks) contain so-called alpha fields in which GSM +03.38 (not ASCII!) characters are packed into 8-bit bytes, with the high bit +zeroed. (These alpha fields also allow alternative UCS-2 encodings, +distinguished by the high bit being set - but we handle this case separately.) +Some other SIM card data files (EF_PNN for example) contain GSM 03.38 7-bit text +strings packed into bytes like in SMS. + +However, when we store text strings (such as phonebook contact names) that have +been read out of a SIM (or are intended to be written to a SIM) in UNIX text +files, or pass them around in command line arguments, we need an ASCII-based +representation of these text strings that are encoded in GSM7 in the actual +GSM/SIM world. Furthermore, our ASCII representation needs to be 100% lossless +and well-defined. + +Our function for lossless conversion of GSM 03.38 strings to ASCII operates as +follows: + +* The output is always enclosed in double-quote characters, as in "text string". + +* All GSM7 code points that map to characters that are also present in ASCII + translate to these ASCII characters: for example, GSM7 code 0x00 becomes '@', + and GSM7 code 0x02 becomes '$'. + +* Any double-quote characters in the data are escaped with a backslash, + becoming \" + +* GSM7 escape sequences for ASCII characters [\]^ and {|}~ are recognized and + converted to these ASCII characters; \ is then escaped in the output as \\ + +* GSM7 code points corresponding to CR and LF are represented as \r and \n + +* GSM7 escape characters that are not part of a valid sequence for [\]^ or {|}~ + are represented as \e + +* All other GSM7 characters that cannot be represented in ASCII in any other + way are represented as \xX escapes, where xX is a two-digit hexadecimal number + in the range between 00 and 7F, inclusive. + +The result of these rules is as follows: + +* If the text item consists entirely of characters that exist in ASCII (the most + common use case), it will appear naturally in ASCII, even if it contains + characters like '@' and '$' that have different code points in GSM7, or + characters in the [\]^ and {|}~ sets that require escaping in GSM7. + +* Any text item containing weird characters will still be converted losslessly, + so it can be written back into the SIM or decoded manually by a GSM7-knowing + user, and the representation in data files and command output is always + printable ASCII, nothing else. + +* In cases where an occasional weird character appears in an otherwise ASCII- + dominated string, it is easy to both mentally decode and manually enter such + characters when necessary. For example, if one of your SIM contacts is a lady + named Michele who spells her name in the French way, with an accent grave on + the first 'e' (non-ASCII character U+00E8), her name shall be entered as + "Mich\04le", nicely preserving the needed non-ASCII character whose GSM 03.38 + code point is 0x04. + +When a string argument that is destined for conversion to GSM7 is parsed, our +input parser always interprets any backslash (\) characters as escapes; it +understands all of the same escapes sequences which we emit in output: + +\" literal " +\\ literal \ (encoded in GSM 03.38 as another form of escape) +\e GSM 03.38 escape character 0x1B +\n GSM 03.38 LF character 0x0A +\r GSM 03.38 CR character 0x0D +\xX GSM 03.38 code point xX, passed through literally + +If the input contains ASCII characters which do not exist in GSM7 (` and all +control characters except \n and \r), it is an error. + +If our ASCII-to-GSM7 conversion functions are given 8-bit input, such input is +interpreted as ISO 8859-1: any 8859-1 high characters that have GSM7 +counterparts will be translated accordingly. (Non-GSM7-mappable high characters +are an error just like non-GSM7-mappable ASCII chars.) However, our output is +always 7-bit ASCII only, using \xX escapes for GSM 03.38 characters that fall +outside of ASCII. + +Phonebook file format +===================== + +fc-simtool pb-dump command displays SIM phonebook content on the terminal or +saves it in a file in the format defined here, and other tools such as +fc-simtool pb-update command need to be able to read back the same format +losslessly. The phonebook file format is hereby shown by way of example: + +#1: #646#,0x81 "Check Minutes" +#2: #674#,0x81 "Check Text Usage" +#3: #225#,0x81 "Check Balance" +#4: 8675309,0x81 "Jenny" +#5: 88211016401,0x91 "sysmoUSIM-SJS1 MSISDN" +#6: 44444,0x81 HEX 810B0893BEC03ABEBC209A9FA1A1 +#7: *123#,0x81 "" +#8: 5551234,0x81 "HEX magic spells by Mich\04le" + +The rules are as follows: + +* Each line in the file format represents one phonebook record. + +* The decimal number between the initial '#' and the following ':' is the + record number in the phonebook, between 1 and 255 as in the SIM protocol + READ RECORD and UPDATE RECORD commands. + +* The phone number is always given without quotes, and consists only of digits + and '*' and '#' characters - no '+' international symbol is allowed in this + file format. + +* The TON/NPI byte is required, is always given in hex as 0xXX (no other form + allowed in this file format), and is separated from the phone number digit + string by a comma. Note how this byte usually equals 0x91 for international + numbers (those entered with a '+' in typical UIs) or 0x81 otherwise. + +* Either a quoted-string or a hex-string is always present at the end of each + record, giving the alpha tag for the phonebook entry. This field is + mandatory in the file format; if there is no alpha tag (really meaning empty + alpha tag), the line ends with empty quoted-string "". + +* Quoted-strings for the alpha tag are used for either empty/null or + GSM7-encoded alpha tags; hex-strings are used for UCS2-encoded alpha tags. + +* The format of hex-string alpha tags is as shown in entry #6 in the example + above - this example gives a contact name in Russian. (Full decoding of this + contact name is left as an exercise for adventurous readers - see + ETSI TS 102 221 Annex A and the Cyrillic block of Unicode.) + +* Hex-strings can be used for any arbitrary bytes in the alpha tag, but are only + needed for UCS-2 encodings. Every possible GSM7 string can be represented in + our quoted-string notation. + +* The quoted-string (GSM 03.38) form of the alpha tag must always be quoted, + even if quotes seem optional like in the "Jenny" example above (record #4). + The absence of quotes is what allows the HEX keyword to be distinguished: + compare and contrast records #6 and #8 in the example. + +The above format applies when the almost-never-used CCP and EXT bytes in the +phonebook record both equal 0xFF, meaning not used. In the unlikely case when +these fields are used, the following extra fields are added to the line-based +representation: + +* If CCP != 0xFF, a "CCP=%u " field is inserted between the phone number and + the alpha tag. + +* If EXT != 0xFF, a "EXT=%u " field is inserted between the phone number and + the alpha tag. + +* If both CCP and EXT are present, the CCP= field appears before the EXT= field, + same order as in the SIM binary record.