Using Unicode in embedded software
Unicode provides a single character set that can represent nearly all of the world’s written languages. Mainstream software development has largely moved to Unicode already, helped by the fact that in modern languages such as Java and C#, type char is defined to be a Unicode character. However, in C a char is invariably 8 bits on modern architectures, and the associated character set is ASCII. Does this matter, for embedded software?
It matters if you need either to accept input or to generate output in languages not supported by ASCII. Maybe you’re planning some new embedded software now. Your current customers may be happy with English language status text on the display of your device; but what export markets might you miss out on? Designing an architecture suitable for more than one language is less expensive when the software is first written than retro-fitting it later. Here are three ways you can do it.
1. Use an 8-bit extended character set. The standard here is ISO 8859. Unfortunately, different languages need character sets, because 255 characters is by no means enough to cover a wide variety of languages. So ISO 8859 defines around 15 different character sets, of which ISO 8859-1 (aka Latin-1) is the most widely used (note however that it doesn’t include the Euro sign – you’ll need ISO 8859-15 or -16 instead if you want to display currency symbols).
This approach has some drawbacks:
- If you need to support more languages than a single extended character set supports, then you’ll need to use different character sets for different markets. This in turn will require the rendering of characters on any display device to be dependent on the target market.
- Ideally, you want strings in your source file to look exactly as they will on the output device, for example “fermé”. But to do this, you’ll need to configure your editor to use the same character set as the target. Many editors don’t provide this facility, and if you’re not careful then you’ll end up using the wrong character set. So you’ll probably have to write “ferm\xE9” instead.
2. Use UTF-8 encoded strings. UTF-8 is a way of encoding any Unicode character string as a sequence of 8-bit bytes. Characters in the ASCII range 00-7F (hex) are represented in a single byte and are the same as in ASCII. Other characters are represented in 2 to 4 bytes.
The main drawback of this approach is that in a C array of characters, the number of characters represented is no longer equal to the number of elements in the array (or up to the null terminator). Whenever you work with the length of a string, you need to be very clear whether you mean the number of char elements in it or the number of displayed characters it represents.
3. Use wide characters. This is the most flexible approach. If you can afford the memory space to store multiple translations of your status strings, then you can produce just one version of your device, with a configuration option to select the end-user language. But watch out for the following:
- wchar_t will be either a 16-bit or a 32-bit character type, depending on your compiler. So characters and strings will take 2 or 4 times as much memory as they do when using plain char.
- If wchar_t is 16 bits, then Unicode characters that are not in the first 65536 will either not be supported at all, or will be encoded as 2 wide characters (UTF-16 encoding). However, such characters are used only in rare scripts, and the chances are that an embedded device will not need to support them.
- If you want WSIWYG strings such as “fermé” in source text, you’ll need to store your source files in some form of Unicode, and you’ll need to make sure that your compiler understands the encoding. Most compilers support UTF-8 source files these days. If you need a free Windows-hosted editor that supports Unicode, you could try PSPad.
- Unicode provides two ways of representing characters containing diacritical marks, such as “é”. In all common cases, there is a single code point that represents the composite character. However, it is also possible to represent them using the unadorned character followed by a second character that represents a diacritical mark to be combined with it. You’ll almost certainly want to use the composite version, so that 1 wide character == 1 displayed character. You’ll need to make sure that your editor represents them this way, and that any Unicode input provided to your program is in this form.
- You may have been assuming that sizeof(char) == 1 in code such as the following:
static char msg = "closed";
const size_t msgChars = sizeof(msg) - 1;
The second line should instead be written as:
const int msgChars =
(sizeof(msg)/sizeof(msg)) - 1;
so that it still gives the correct number of characters when you replace the first line by:
static wchar_t msg = L"fermé";
- Header file wchar.h provides wide character versions of many of the standard string functions in string.h, however the semantics are not always the same.
- Wide characters are not type-safe in C’90, because wchar_t is just a typedef for some other integral type (and you’ll need to #include <wchar.h> to make it available). Once again, C++ does it better, by providing wchar_t as a separate built-in type. If you’re using ArC to analyse your software, then you’ll get the benefits of a strong wchar_t type even in C, because ArC pretends it is a separate type and ignores any typedef of wchar_t.
What if you’re not ready to commit to Unicode, but you might want to switch your software to Unicode in future? You can use the following definitions:
typedef wchar_t char_t;
#define CONCAT(_a, _b) _a ## _b
#define _T(_text) CONCAT(L, _text)
typedef char char_t;
#define _T(text) text
You can then write the following:
const char_t msg = _T("closed");
making it easier to switch between ASCII and Unicode. If you use functions from string.h in your program, then you may also want to #define your own versions that map either to the standard versions or to the wide versions.