What are you trying to prove?
If you’re thinking of using formal verification to increase the quality and reliability of your software, one of the first decisions you need to make is what you want to prove. Roughly speaking, you have three levels to choose from:
1. Predictable execution – that is, freedom from undefined and implementation-defined behaviour. The proofs cover such things as: all array indices are in bounds, variables are initialized before use, conversions of values to narrower types do not result in overflow, arithmetic does not overflow or underflow, null pointers are not dereferenced, and division by zero does not occur. It includes all the usual causes of what the MISRA standards call “run-time errors”, so it covers MISRA C rule 21.1.
2. Partial correctness – all the above, and the program meets its specification, if it terminates. The specification you provide may be a full functional specification, or a partial specification covering the most important parts, such as safety properties.
3. Full correctness – all the above, and the program terminates.
Most formal program verification systems supports one or more of these levels. For example, SPARK Ada supports levels 1 and 2. In SPARK-speak, level 1 is called “exception freedom”, because Ada programs normally perform run-time checking, so run-time errors manifest themselves as exceptions (which are predictable). On the other hand, Perfect Developer is focussed on full correctness, because it is a top-down tool that starts from specifications – leaving you no choice but to specify what the program is intended to achieve. ArC gives you the choice of all three levels.
You don’t have to choose a single verification level for the entire system. So you might choose full correctness for the most critical parts of the system, but be content with predictable execution elsewhere.
What difference does the choice of verification level make to you as software developer? The main effect is that if you wish to prove partial or full correctness rather than just predictable execution, you will need to write function postconditions that state what the functions are intended to achieve. If you choose only to prove predictable execution, you will sometimes still need to write function postconditions, but they can be more vague – stating something about what the function achieves rather than everything it is supposed to achieve.
Here’s an example. Suppose we have read a value from a sensor that has a non-linear response. We wish to linearize the value. To this end, we have a constant table of pairs of (raw value, linearized value), ordered such that both values are monotonically increasing. We will use a binary search to find the two adjacent entries whose raw values bracket the value read from the sensor, then interpolate between the corresponding linearized values. So the function looks something like this:
#include "arc.h"
#include "stddef.h"
typedef unsigned short uint16_t;
typedef struct { uint16_t raw; uint16_t corrected; } LinEntry;
size_t bSearch(const LinEntry* array table, size_t nElems, uint16_t key);
uint16_t linearize(const LinEntry* array table, size_t nElems, uint16_t rawValue) {
size_t index = bSearch(table, nElems, rawValue);
if (index == 0) return table[0].corrected;
else if (index == nElems) return table[nElems - 1].corrected;
else {
LinEntry low = table[index - 1], high = table[index];
return (((high.corrected - low.corrected)/(high.raw - low.raw))
* (rawValue - low.raw)) + low.corrected;
}
}
static const LinEntry linTable[] = {
{0, 0}, {10, 15}, {20, 29}, ...
};
... linearize(linTable, sizeof(linTable)/sizeof(linTable[0]), someData) ...
The bSearch function takes a pointer to an array of LinEntry records (see my earlier post on taming arrays if you haven’t seen the array keyword before), the number of elements in the array, and the raw data value. I’ve assumed it returns a value in the range 0..nElems, such that 0 means the raw value is off the bottom of the table, nElems means it is off the top, otherwise the table entry it indexes and the previous table entry bracket the raw value. If we want to prove partial or full correctness, we need to specify all of this. But we can be less precise if we’re just proving predictable execution. Let’s see what ArC makes of it with no annotation other than the array keywords:
c:\arc\ex1.c (11,33): Warning! Unable to prove: Precondition of ‘[]‘ satisfied
(see C:\Arc\ex1_unproven.htm#9), cannot prove: 0 < table.lim.
c:\arc\ex1.c (12,43): Warning! Exceeded boredom threshold proving: Precondition of ‘[]‘ satisfied
(see C:\Arc\ex1_unproven.htm#7), cannot prove: nElems < (1 + table.lim).
c:\arc\ex1.c (14,29): Warning! Exceeded boredom threshold proving: Precondition of ‘[]‘ satisfied
(see C:\Arc\ex1_unproven.htm#2), cannot prove: index < (1 + table.lim).
c:\arc\ex1.c (14,54): Warning! Exceeded boredom threshold proving: Precondition of ‘[]‘ satisfied
(see C:\Arc\ex1_unproven.htm#4), cannot prove: index < table.lim.
c:\arc\ex1.c (15,17): Warning! Unable to prove: Precondition of ‘/’ satisfied
(see C:\Arc\ex1_unproven.htm#5), cannot prove: low.raw < high.raw.
There are a few problems here, all in function linearize:
- The first warning says that table[0] might be out-of-bounds, because table might be an empty array.
- The second says that table[nElems - 1] might be out-of-bounds, because we haven’t told ArC that nElems is the number of elements in table.
- The third and fourth say that table[index] and table[index - 1] might be out-of-bounds, because ArC doesn’t know that bSearch returns a value between 0 and nElems, and that table points to the start of an array with nElems elements.
- The final warning says that in the interpolation operation, the divisor (table[index].raw – table[index - 1].raw) might not be positive. Division by zero would typically cause an exception; while division by a negative number yields an implementation-defined result in C’90 and is therefore not allowed by ArC. The division will be safe if the raw values in table are monotonically increasing.
So let’s give ArC the information it needs, by adding some partial specifications:
size_t bSearch(const LinEntry* array table, size_t nElems, uint16_t key) post(result <= nElems); uint16_t linearize(const LinEntry* array table, size_t nElems,uint16_t rawValue) pre(table.lwb == 0; table.lim == nElems) pre(nElems >= 1) pre(forall i in 1..(nElems - 1) :- table[i].raw > table[i - 1].raw) { size_t index = bSearch(table, nElems, rawValue); if (index == 0) return table[0].corrected; else if (index == nElems) return table[nElems - 1].corrected; else { LinEntry low = table[index - 1], high = table[index]; return (((high.corrected - low.corrected)/(high.raw - low.raw)) * (rawValue - low.raw)) + low.corrected; } }
In the postcondition of bSearch, we’ve specified that the return value is no greater than nElems (since it is of type size_t, it can’t be less than zero anyway). In the precondition of linearize we’ve said that table is a regular array with lower bound zero and limit nElems. These are enough to ensure that the array accesses table[index - i] and table[index] in the else-part of the if-statement are in bounds, because the previous parts of the if-statement handle the cases index == 0 and index == nElems. We’ve also specified that nElems is at least one to take care of the access to table[0].
The final precondition we’ve added to linearize says the raw values in table are monotonically increasing, by stating that for all values i from 1 up to nElems - 1 (i.e. the highest index of table), table[i].raw is greater than table[i - 1].raw.
These partial specifications are sufficient for ArC to prove that function linearize executes predictably, assuming that it is called in a state satisfying its preconditions, and that bSearch executes predictably, has no side-effects, and returns a value satisfying its postcondition. The preconditions of linearize will be verified at every place where it is called. We’ll look at verifying bSearch next time.
Danger – unsigned types used here!
By way of a change from the last two posts on formal verification, this time I’m going to talk about using unsigned types in C/C++. Modern programming languages such as C# and Java don’t provide unsigned types, with good reason (actually, C# does have unsigned types, but only for the purpose of interfacing to COM objects written in other languages, and they are not used in the .NET Framework API).
To illustrate the dangers of using unsigned types, I invite you to consider this example (in which uint16_t and int32_t are typedef’d to be unsigned 16-bit and signed 32-bit types respectively) and decide whether it makes safe use of unsigned values:
void foo(const uint16_t* reading, const size_t size) {
size_t i;
for (i = size - 1; i >= 0; --i) {
doSomething(reading[i]);
if (i > 0) {
int32_t diff = reading[i] - reading[i - 1];
doSomethingElse(diff);
}
}
}
Why are unsigned types dangerous in C/C++? Here are some reasons:
1. C/C++ provide implicit type conversions between signed and unsigned values. Unlike Ada, there is no a runtime check to make sure the value is convertible to the new type. For example, you can readily “convert” a negative signed value to an unsigned value.
2. If you mix signed and unsigned operands in an arithmetic operation, the signed operand is always converted to unsigned. This may not be what you wanted.
3. In arithmetic expressions, operands whose type is smaller than (unsigned) int get promoted to int or unsigned int. This complicates the situation – especially if the result is implicitly converted again, because the implicit type conversions may not be happening where you think they are.
4. It is very easy to underflow the minimum value of an unsigned int (i.e. zero). It is much more difficult to underflow or overflow the minimum or maximum value of a signed int, especially on a 32-bit (or greater) platform.
How many problems did you spot in the example? There are at least three. The first one is the use of size – 1 as the initial value of the loop counter i. Since size is of type size_t which is an unsigned type, size – 1 will underflow if size is zero. In this case, the loop will count down from the maximum value of a size_t rather than not executing at all. The second problem is the use of i >= 0 in the for-loop header, with i again of type size_t. The loop will never terminate, because i cannot go negative. The third problem concerns the assignment of reading[i] – reading[i - 1] to diff. Suppose reading[i -i] is greater than reading[i], for example by one. Will diff end up as -1 ? Unfortunately, that depends on your compiler and target platform. If uint16_t maps to unsigned short and int32_t maps to int, and assuming 2′s complement hardware, then yes. Both readings will be promoted to unsigned int prior to subtraction, yielding 0xFFFFFFFF. This is then converted implicitly to int for the assignment to diff, yielding -1. But if uint16_t maps to unsigned int and int32_t maps to long, we get a different result. The subtraction yields 0xFFFF this time, which is converted to 0x0000FFFF for assignment to diff.
If you regularly use static analysis on your C/C++ programs (as I hope you do), you might like to check whether your static analyzer reports all three problems for this example.
What’s the best way to manage the dangers inherent in using unsigned types? One strategy is Just Say No. Don’t use unsigned types, except in very special situations. Is this really feasible? When using a 32-bit (or greater) platform, I think it is. You’ll need to define a signed size-type (i.e. an integral type with the same size as size_t) to use as the natural type for representing array indices, string lengths, and sizes. On most platforms, ptrdiff_t is a suitable type to start from. Whenever you call a library function that returns a size_t, or use a sizeof expression, you should immediately cast the result to your signed size type. You may have difficulties if you use an array or string that takes up more than half the address space of the processor, but are you ever going to do that? The other thing you’ll need to do is convert unsigned data read from the hardware to signed data as soon as it is read in, after any necessary shifting and masking. If you read an unsigned value from a 16-bit A-to-D converter, you’ll need to store each value as 32-bits. Alternatively, if you need to store lots of them, you can exempt that data from Just Say No and store it as 16-bit unsigned values, doing the conversion to 32-bit signed int whenever you pick a value out from the store.
Here’s my original example using Just Say No, assuming that the readings came from a 15-bit (or less) A-to-D converter, with sz_t standing for our signed size type:
void foo(const int16_t* reading, const sz_t size) {
sz_t i;
for (i = size - 1; i >= 0; --i) {
doSomething(reading[i]);
if (i > 0) {
int32_t diff = reading[i] - reading[i - 1];
doSomethingElse(diff);
}
}
}
The only changes I made are to replace unsigned types by signed types. You’ll still need to use unsigned values where you do bitwise operations, such as shifting and masking. But you probably only need to do these operations on raw incoming data, before storing the result as signed data. Also, if you are using bit fields, any fields that store data of Boolean or enumeration type should be declared unsigned.
If Just Say No is too radical for you, then the alternative is to embrace unsigned values, but be very careful using them. You must use a good static analyzer to detect possible problems. MISRA C 2004 rule 10.1 prohibits implicit signed/unsigned conversions (amongst others), so a static analyzer that enforces MISRA compliance should catch them. Even better, use a formal tool such as ArC. Here’s a safer version of our example that still uses unsigned types:
void foo(const uint16_t* reading, const size_t size) {
size_t i;
for (i = size; i != 0; ) {
--i;
doSomething(reading[i]);
if (i > 0) {
int32_t diff = (int32_t)(reading[i]) - (int32_t)(reading[i - 1]);
doSomethingElse(diff);
}
}
}
I’ve changed the loop to count down from size instead of size – 1, I’ve changed the termination condition, and I’ve moved the decrement of i to the start of the loop body. I’ve also cast the unsigned readings to signed before subtracting them.
Have you been caught out by the behavour of unsigned types? Does your organization have a policy on using them? As always, I welcome your feedback – just click on the Leave a comment link near the top of this page.
Making sure variables are initialized
One source of program bugs is use of variables before they have been initialized. In C/C++ all static variables get zero-initialized if they have no specified initialization, so it is only local variables we need to worry about. Bugs caused by use of uninitialized local variables can be particularly nasty, because the value of such a variable depends on whatever previously occupied the same stack location. So the value read can vary depending on the past history of the program, and on the optimization level selected when the program was compiled (thereby causing debug builds and release builds to behave differently). An uninitialized variable bug that was benign and undetected in one version of a program can suddenly rear its ugly head when an unrelated change is made to the program.
We therefore need to ensure that all automatic variables are initialized before being used (MISRA C 2004 rule 9.1). Let’s consider the case of variables of primitive types first – we’ll look at structs, classes and unions later. Most static analysers can warn about simple cases like this:
void foo() {
int i;
... /* code that doesn't assign i */
bar(i); /* error, i not initialized */
}
Many static analysers will also warn about the following:
void foo() {
int i;
if (b1) { i = 0; }
... /* code that doesn't assign i */
if (b2) { bar(i); }
}
If b1 in the first if-statement was true whenever b2 in the second if-statement is true, then such a warning is a false positive. One approach when using such a static analyser is to add a dummy initialization for i, after satisfying oneself that the code is correct. Modern languages such as C# and Java use a concept of “definite initialization” that also requires a dummy initialization in this situation, otherwise the compiler will refuse to compile the program. However, ArC does a more thorough analysis and will only generate a warning if it can’t prove that b2 implies b1, thereby avoiding the need for a dummy initialization.
Here is a more difficult case:
void foo() {
int i;
bar(&i); /* does this require i to be initialized first? */
fum(i); /* has i been initialized yet? */
}
The problem here is that we don’t know whether bar() expects its parameter to point to initialized memory or not, or whether bar() always writes to the variable addressed by its parameter. If bar() is a fairly simple function, then a static analyser may be able to work it out; but that isn’t always possible, especially if bar() calls other functions whose source is not available.
ArC gets round this problem by providing an out-parameter annotation. If bar() is declared with the following signature:
void bar(out int* p);
then bar() does not require p to point to an initialized variable, but it is obliged to assign *p prior to returning. As usual with ArC keywords, when the program is being compiled, out is defined as a macro expanding to nothing, so the compiler doesn’t object to this syntax.
What about structs and classes? If you’re using C++, then I recommend that you define at least one constructor in every class or struct declaration. This prevents you from declaring a variable of that type without a constructor being called. ArC requires a constructor to initialize all fields of the class or struct before returning, and will verify this.
What about unions? You may recall from my previous post that ArC considers each union to include a ghost discriminant field that can be interrogated using the holds operator. Well, ArC considers an uninitialized union to hold nothing. Therefore, attempting to access a member of an uninitialized union will always result in a verification failure.
Using and Abusing Unions
The C union type is one of those features that is generally frowned on by those who set programming standards for critical systems, yet is quite often used. MISRA C 2004 rule 18.4 bans them (“unions shall not be used”) on the grounds that there is a risk that the data may be misinterpreted. However, it goes on to say that deviations are acceptable for packing and unpacking of data, and for implementing variant records provided that the variants are differentiated by a common field.
According to K&R’s The C Programming Language, “A union is a variable that may hold (at different times) objects of different types and sizes, with the compiler keeping track of size and alignment requirements.” Note the words at different times. So it appears that they didn’t expect programmers to use them to pack and unpack data, using code such as the following:
uint16_t unpack(uint8_t lobyte, uint8_t hibyte) {
union {
uint16_t wordData;
uint8_t byteData[2];
} temp;
temp.byteData[0] = lobyte;
temp.byteData[1] = hibyte;
return temp.wordData;
}
I regard this as an abuse of unions. This code is not portable, because its behaviour is dependent on how the compiler lays out the union, and on whether the processor is big-endian or little-endian. Is there any need to use it? Let’s look at the alternative:
uint16_t unpack(uint8_t lobyte, uint8_t hibyte) {
return ((uint16_t)hibyte) << 8 ) | (uint16_t)lobyte;
}
This version also makes it clear that lobyte contains exactly the lower 8 bits of the data (which was probably read from an I/O port), and does not assume that uint8_t has exactly 8 bits (as opposed to at least 8 bits).
Is there any reason not to use this code and avoid the union? There might be a performance penalty, but only if you are using a poor compiler or a very low optimization level so that the compiler does not implement the shift using a move or byte swap instruction, and the processor does not have barrel-shift hardware. In other cases, this version might be faster than using a union, because optimizing it does not require a variable to be eliminated.
ArC requires programs to be type-safe, and doesn’t make assumptions about endianness or struct and union layout and alignment. So it doesn’t support use of unions in this way. In the event that you really do need to use a union for packing or unpacking data, you can fool ArC like this:
#ifdef __ARC__
// define the shift version of unpack
...
#else
// define the union version of unpack
...
#endif
but you are then assuming responsibility for ensuring that the union version behaves correctly.
What about using unions for their intended purpose, i.e. holding different types of data at different times? The usual criticism here is that C unions don’t have automatic discriminants, so the compiler cannot insert run-time checks. Why not verify formally that the data is never misinterpreted instead? What we need to ensure is that a union is only ever read through the same member as was last used to assign it. We express the concept that “member M was last used to assign the value of E” in ArC using the syntax E holds M. We can use a holds expression anywhere in a specification or any other ghost context, but not of course in real code. Here’s an example:
struct Status { ... };
struct Error { ...};
union StatusOrError {
struct Status st;
struct Error err;
};
static union StatusOrError lastResult;
Whenever ArC sees lastResult.err or lastResult.st being read, it will attempt to prove lastResult holds err or lastresult holds st respectively. If we want to write a function that assumes that lastResult holds a particular member, ArC will fail to verify the function unless we declare that assumption as a precondition. For example:
void displayError()
pre(lastResult holds err)
{ ... lastResult.err ... }
void displayStatus()
pre(lastResult holds st)
{ ... lastResult.st ... }
Now ArC will need to verify that the precondition holds at each call to displayError or displayStatus:
lastResult.err = ... ;
displayError(); // OK
displayStatus(); // verification failure here
So we have made unions type safe, effectively by adding a ghost discriminant that can be interrogated by a holds expression. If you want to store a real discriminant, you can tie the two together using an invariant:
struct WrappedStatusOrError {
union StatusOrError stOrErr;
enum { disc_st, disc_err } disc;
invariant((disc == disc_st) == (stOrErr holds st))
invariant((disc == disc_err) == (stOrErr holds err))
}
Unions are rarely used in regular C++ programming, because variant data is almost always better represented using a class inheritance hierarchy. However, that approach normally requires dynamic memory allocation. Therefore, C-style unions still have a place in embedded C++ programming.
Safer arrays: using a C++ array class
In a previous post, I remarked that arrays in C leave much to be desired, and that in C++ it is better to avoid using naked arrays. You can avoid naked arrays in C++ programming by wrapping them up in a suitable array class instead. The Joint Strike Fighter C++ Coding Standards document takes a similar view; rule 97 in that standard states:
Arrays shall not be used in interfaces. Instead, the Array class should be used.
Rationale: Arrays degenerate to pointers when passed as parameters. This “array decay” problem has long been known to be a source of errors.
Unfortunately, the Array.doc file mentioned in the standard was not made available to the public. However, we can easily assemble our own set of array classes, starting with the fixed-length array class included in C++ Technical Report 1. That class is called std::tr1::array where T is the element type and N is the number of elements. Implementation of this class does not require the use of dynamic memory.
To illustrate the use of the TR1 array class, here is a small example using C arrays:
#define BUFLEN (100)
static int buf[BUFLEN];
for (size_t i = 0; i < BUFLEN; ++i) {
... buf[i] ...
}
Here is the same example using C++ and the array class:
using std::tr1::array;
const size_t buflen = 100;
typedef array<int, buflen> buf_t;
static buf_t buf;
for (buf_t::iterator it = buf.begin(); it != buf.end(); ++it) {
... *it ...
}
Since I want to process every element of buf, I’ve used an array iterator in the for-loop header. If I want to process the elements in reverse order, I can use a reverse_iterator instead:
for (buf_t::reverse_iterator rit = buf.rbegin(); rit != buf.rend(); ++rit) {
... *rit ...
}
Using iterators avoids the risk of off-by-one errors.
If I want to pass one of these buffers to a function, the simplest way is to pass it by reference or by const-reference:
int sumBuffer(const buf_t& src) {
int sum = 0;
for (buf_t::const_iterator it = src.begin(); it != src.end(); ++it) {
sum += *it;
}
return sum;
}
One disadvantage of passing an array in this manner is that the size is part of the type. Therefore, I can’t write a version of sumBuffer that works with buffers of varying sizes, unless I write it as a template – which a compiler would typically instantiate separately for each buffer size.
We can avoid this disadvantage by defining two more class templates to represent references to an array. Here are their outlines:
template<class T> class array_ref {
T* ptr;
size_t sz;
public:
template<size_t L> array_ref(array<T, L>& arg)
: ptr(arg.data()), sz(L) {}
size_t size() const { return sz; }
T& operator[](size_t index) const {
return ptr[index];
}
};
template<class T> class const_array_ref {
const T* ptr;
size_t sz;
public:
template<size_t L>const_array_ref(const array<T, L>& arg)
: ptr(arg.data()), sz(L) {}
size_t size() const { return sz; }
const T& operator[](size_t index) const {
return ptr[index];
}
};
Class array_ref holds a reference to the underlying data held in an array, along with a note of the number of elements. Class const_array_ref does the same but provides read-only access. We can go on to define iterators for these classes, so that we can safely traverse arrays passed by reference. This allows us to write the following:
static array<int, 100> bigBuf;
static array<int, 20> smallBuf;
int sumBuffer(const_array_ref<int> src) {
int sum = 0;
for (const_array_ref::const_iterator it = src.begin(); it != src.end(); ++it) {
sum += *it;
}
return sum;
}
...
int total = sumBuffer(smallBuf) + sumBuffer(bigBuf);
As standard, class std::tr1::array provides bounds-checking for the at() function but not for the indexing operator. However, many implementations provide for the insertion of bounds checks, typically enabled by the DEBUG macro. Similarly, bounds checking can easily be provided in array_ref and const_array_ref if required.
In summary, C++ allows us to avoid the problems of naked array pointers, by providing us with alternative array representations that carry size information, support iterators so that we can more safely iterate through arrays, and optionally include run-time bounds checking.
How (un)safe is pointer arithmetic?
I recognize that this is a controversial topic – if you’re a safety-critical professional using C or C++, I’d be glad to hear your views.
Using explicit pointer arithmetic in critical software is generally frowned upon. MISRA 2004 rules 17.1 to 17.3 prohibit some particular cases of explicit pointer arithmetic that do not give rise to well-defined results. Rule 17.4 then states that “Array indexing shall be the only allowed form of pointer arithmetic”. All 4 rules are “required” rather than “advisory”, so 17.4 appears to make the preceding 3 rules redundant. The implication seems to be that developers who break 17.4 should at least honour 17.1 to 17.3.
The most common use of explicit pointer arithmetic in C is to increment a pointer while processing an array of data in a loop:
for (T* p = arr; p != arr + numElements; ++p) {
*p = foo(*p);
}
The expression arr + numElements is a classic C pointer to one-past-the-last-element of the array.
In C++ programming using the Standard Template Library, the idiom of processing an array by incrementing a pointer through its elements is generalised through the concept of iterators:
for (C<T>::iterator p = coll.begin(); p != coll.end(); ++p) {
*p = foo(*p);
}
where coll has type C<T>, i.e. some collection whose elements are of type T.
We can readily replace the C example above with code that does not use pointer arithmetic:
for (size_t i = 0; i != numElements; ++i) {
arr[i] = foo(arr[i]);
}
Does this version have any disadvantages compared to the one that increments the pointer? Well, it may be a little slower. A non-optimising compiler will generate code to evaluate arr[i] every time it appears within the body of the loop – slower than evaluating *p. An optimising compiler will probably create an induction variable to track the value of &arr[i], mimicking pointer p from the original version. Even so, on a RISC architecture it needs to use 3 registers to be efficient (to hold i, &arr[i], and numElements); whereas the original version needs just 2 (to hold p and arr + numElements).
Ideally, architects of critical systems should specify sufficiently powerful hardware that the software developers can concentrate on correctness and not have to worry about minor differences in efficiency. Unfortunately, in mass markets such as the automobile industry, and in low-energy markets such as space, pressure to minimise hardware power is often present.
So, can we safely ignore MISRA rule 17.4 and use the pointer-incrementing version? Is pointer arithmetic really any more dangerous than array indexing? Arithmetic on pointers can yield pointers that are outside the array they were intended to point to; but is this a worse or harder-to-avoid hazard than indexing an array out-of-bounds? If you’ve read this far and have a view on this, why not leave a comment on this entry?
Because we do occasionally see instances of pointer arithmetic in safety-critical code we’ve been asked to look at, we decided to see whether could verify these properties in ArC:
- Pointer arithmetic is only applied to array pointers;
- Pointer arithmetic always yields a pointer into the original array or to one-past-the-last-element of the original array;
- When two pointers are compared or subtracted, they point into the same array;
- A pointer to one-past-the-last-element is never dereferenced.
The answer is that it is no more difficult verify these properties than to verify that array indices are in bounds. So we do support pointer arithmetic in ArC. That is why array pointers in ArC have the lwb ghost member (see my earlier post about taming array pointers). In the absence of pointer arithmetic, the lower bound is always zero.
If you’re using C++ rather than C and you’re taking my earlier advice to use Array, Vector or similar classes in preference to naked arrays, then you may be tempted to revert to naked arrays and pointer arithmetic where efficiency is vital. However, we’ve found it straightforward to implement an efficient Array class that supports either pointer arithmetic or iterators; and when run-time index checking is enabled, we provide run-time pointer arithmetic or iterator checks as well.
Using Unicode in embedded software
Unicode provides a single character set that can represent nearly all of the world’s written languages. Mainstream software development has largely moved to Unicode already, helped by the fact that in modern languages such as Java and C#, type char is defined to be a Unicode character. However, in C a char is invariably 8 bits on modern architectures, and the associated character set is ASCII. Does this matter, for embedded software?
It matters if you need either to accept input or to generate output in languages not supported by ASCII. Maybe you’re planning some new embedded software now. Your current customers may be happy with English language status text on the display of your device; but what export markets might you miss out on? Designing an architecture suitable for more than one language is less expensive when the software is first written than retro-fitting it later. Here are three ways you can do it.
1. Use an 8-bit extended character set. The standard here is ISO 8859. Unfortunately, different languages need character sets, because 255 characters is by no means enough to cover a wide variety of languages. So ISO 8859 defines around 15 different character sets, of which ISO 8859-1 (aka Latin-1) is the most widely used (note however that it doesn’t include the Euro sign – you’ll need ISO 8859-15 or -16 instead if you want to display currency symbols).
This approach has some drawbacks:
- If you need to support more languages than a single extended character set supports, then you’ll need to use different character sets for different markets. This in turn will require the rendering of characters on any display device to be dependent on the target market.
- Ideally, you want strings in your source file to look exactly as they will on the output device, for example “fermé”. But to do this, you’ll need to configure your editor to use the same character set as the target. Many editors don’t provide this facility, and if you’re not careful then you’ll end up using the wrong character set. So you’ll probably have to write “ferm\xE9″ instead.
2. Use UTF-8 encoded strings. UTF-8 is a way of encoding any Unicode character string as a sequence of 8-bit bytes. Characters in the ASCII range 00-7F (hex) are represented in a single byte and are the same as in ASCII. Other characters are represented in 2 to 4 bytes.
The main drawback of this approach is that in a C array of characters, the number of characters represented is no longer equal to the number of elements in the array (or up to the null terminator). Whenever you work with the length of a string, you need to be very clear whether you mean the number of char elements in it or the number of displayed characters it represents.
3. Use wide characters. This is the most flexible approach. If you can afford the memory space to store multiple translations of your status strings, then you can produce just one version of your device, with a configuration option to select the end-user language. But watch out for the following:
- wchar_t will be either a 16-bit or a 32-bit character type, depending on your compiler. So characters and strings will take 2 or 4 times as much memory as they do when using plain char.
- If wchar_t is 16 bits, then Unicode characters that are not in the first 65536 will either not be supported at all, or will be encoded as 2 wide characters (UTF-16 encoding). However, such characters are used only in rare scripts, and the chances are that an embedded device will not need to support them.
- If you want WSIWYG strings such as “fermé” in source text, you’ll need to store your source files in some form of Unicode, and you’ll need to make sure that your compiler understands the encoding. Most compilers support UTF-8 source files these days. If you need a free Windows-hosted editor that supports Unicode, you could try PSPad.
- Unicode provides two ways of representing characters containing diacritical marks, such as “é”. In all common cases, there is a single code point that represents the composite character. However, it is also possible to represent them using the unadorned character followed by a second character that represents a diacritical mark to be combined with it. You’ll almost certainly want to use the composite version, so that 1 wide character == 1 displayed character. You’ll need to make sure that your editor represents them this way, and that any Unicode input provided to your program is in this form.
- You may have been assuming that sizeof(char) == 1 in code such as the following:
static char msg[] = "closed";
const size_t msgChars = sizeof(msg) - 1;
The second line should instead be written as:
const int msgChars =
(sizeof(msg)/sizeof(msg[0])) - 1;
so that it still gives the correct number of characters when you replace the first line by:
static wchar_t msg[] = L"fermé";
- Header file wchar.h provides wide character versions of many of the standard string functions in string.h, however the semantics are not always the same.
- Wide characters are not type-safe in C’90, because wchar_t is just a typedef for some other integral type (and you’ll need to #include <wchar.h> to make it available). Once again, C++ does it better, by providing wchar_t as a separate built-in type. If you’re using ArC to analyse your software, then you’ll get the benefits of a strong wchar_t type even in C, because ArC pretends it is a separate type and ignores any typedef of wchar_t.
What if you’re not ready to commit to Unicode, but you might want to switch your software to Unicode in future? You can use the following definitions:
#ifdef UNICODE
typedef wchar_t char_t;
#define CONCAT(_a, _b) _a ## _b
#define _T(_text) CONCAT(L, _text)
#else
typedef char char_t;
#define _T(text) text
#endif
You can then write the following:
const char_t[] msg = _T("closed");
making it easier to switch between ASCII and Unicode. If you use functions from string.h in your program, then you may also want to #define your own versions that map either to the standard versions or to the wide versions.