## Verifying pointer arithmetic

Today I’ll look at whether code that uses pointer arithmetic is any harder to verify than equivalent code that does not use pointer arithmetic.

Consider this function for copying an array (or part of an array) into another array (or part of another array):

voidarrayCopy(const int* src,int* dst, size_t num) { size_t i;for(i = 0; i < num; ++i) { dst[i] = src[i]; } }

To verify it, we first need to specify what it is supposed to achieve in a postcondition and add any necessary preconditions. Then we need to add a loop invariant so that ArC can verify that the loop achieves the desired state when it terminates, and a loop variant so that ArC can prove that the loop terminates:

voidarrayCopy(const int*arraysrc,int*arraydst, size_t num)writes(dst.all)pre(src.lim >= num; dst.lim >= num)pre(disjoint(src.all, dst.all))post(foralliin0..(num - 1) :- dst[i] == src[i]) { size_t i;for(i = 0; i < num; ++i)keep(i <= num)keep(foralljin0..(i - 1) :- dst[j] == src[j])decrease(num - i) { dst[i] = src[i]; } }

This is fairly straightforward. As usual, the main loop invariant (the **keep**(**forall**…) clause) is a generalisation of the desired postcondition, and we need another invariant (the **keep**(i <= num) clause) to ensure that the main invariant and the body don’t violate the preconditions of the indexing operator. These invariants and the loop variant (the **decrease**(…) clause) are sufficient for ArC to verify this function.

Many C programmers would write the body as **dst++ = *src++;* instead of explicitly indexing *dst *and *src*. If we make this change, then because we are changing *src *and *dst* within the loop, we need to describe their values in the invariant. Here is how we can do that:

voidarrayCopy(const int*arraysrc,int*arraydst, size_t num)writes(dst.all)pre(src.lim >= num; dst.lim >= num)pre(disjoint(src.all, dst.all))post(foralliin0..(num - 1) :- dst[i] == src[i]) { size_t i;for(i = 0; i < num; ++i)keep(i <= num)keep(foralljin0..(i - 1) :- (olddst)[j] == (oldsrc)[j])keep(src == (oldsrc) + i; dst == (olddst) + i)decrease(num - i) { *dst++ = *src++; } }

Applying the keyword **old **to an expression in a loop invariant or loop variant yields the value of that expression just before the first iteration of the loop. I’ve had to replace *src *in the second loop invariant by * old src* and similarly for

*dst*, because in that invariant I am referring to array elements computed relative to the original values of

*src*and

*dst*. I’ve described the modification to

*src*and

*dst*by adding a third invariant, that says that at the start and end of any iteration,

*src*has its original value plus

*i*, and similarly for

*dst*. The post-increment operators applied to

*src*and

*dst*in the body ensure that these relationships are maintained. Once again, the function is verifiable by ArC.

What if we go further and get rid of the loop counter *i *as well? We can do that if we precompute the end pointer *src + num* at the start, and then iterate until *src *reaches this point. We still need to describe the values of *src *and *dst *in the invariant, but we must do so without the benefit of *i*. Here’s one way of doing it:

voidarrayCopy(const int*arraysrc,int*arraydst, size_t num)writes(dst.all)pre(src.lim >= num; dst.lim >= num)pre(disjoint(src.all, dst.all))post(foralliin0..(num - 1) :- dst[i] == src[i]) {const int*array constsrcEnd = src + num;while(src != srcEnd)keep(src.base ==old(src.base))keep(0 <= src - (oldsrc))keep(src - (oldsrc) <= n)keep(foralljin0..((src - (oldsrc)) - 1) :- (olddst)[j] == (oldsrc)[j])keep(dst == (olddst) + (src - (oldsrc)))decrease(srcEnd - src) { *dst++ = *src++; } }

Here’s how I arrived at the new loop annotations:

- I’ve taken the three
**keep**-clauses from the previous version and substituted*(src – (*for the loop counter**old**src))*i*. That’s how I got the 3rd, 4th and 5th**keep**-clauses. For the last of the three, I also removed the resulting tautology*src == (*, leaving just the component that talks about**old**src) + (src – (**old**src))*dst*. - The original first invariant
placed an upper bound on**keep**(i <= n)*i*, thereby ensuring that*src[i]*and*dst[i]*in the original code or**src*and**dst*in the modified code were in bounds. I didn’t need to put a lower bound on*i*because its type was*size_t*which is an unsigned type, so its lower bound was zero implicitly. But I’ve now replaced*i*by*(src – (*, which has no such constraint. So I need to put a zero lower bound on this expression, which is what the second**old**src))**keep**-clause is for. - Unfortunately, this isn’t quite enough. The expression
*src –*uses the pointer-difference operator, which has the precondition that the operands are pointers into the same array. In general, when we re-assign an array pointer such as**old**src*src*, there is no requirement that it points into the same array as before. In this case, all we ever do to*src*is increment it; so it obviously continues to point into the same array as long as it is not over-incremented (which ArC verifies). However, ArC isn’t yet clever enough to spot this and generate an implicit loop invariant. So I’ve added the first invariant*src.base ==*right at the beginning of the loop. In the expression**old**(src.base)*src.base*,*base*is a ghost field of*src*, just like*lim*and*lwb*. It subtracts*src.lwb*from*src*, yielding a pointer to the very start of the array that*src*points into. So to state that two array pointers point into the same array, I just need to say that their bases are equal. - Rather than just substitute for
*i*in the loop variant, I’ve changed it to say that I expect the difference between*srcEnd*and*src*to decrease on each iteration.

I hope this example demonstrates to you that verification of C code that uses pointer arithmetic can be done, although it is likely to require more loop invariants than equivalent code that does not use pointer arithmetic, and getting them right may be a little harder. The number of verification conditions to be proved may be higher too – there were 24 from the original code, 32 from the second version, and 35 from the last version. So indexing should generally be preferred to explicit pointer arithmetic – as recommended by MISRA-C 2004 rule 17.4 – when writing verifiable code.

Finally, we could implement this function using *memcpy*:

voidarrayCopy(const int*arraysrc,int*arraydst, size_t num)writes(dst.all)pre(src.lim >= num; dst.lim >= num)pre(disjoint(src.all, dst.all))pre(num *sizeof(int) <= maxof(size_t))post(foralliin0..(num - 1) :- dst[i] == src[i]) { memcpy(dst, src, num *sizeof(int)); }

I’ve had to add another precondition to ensure that the computation of the number of bytes *num * sizeof(int)* doesn’t overflow a

*size_t*. If Arc were to assume that the number of bytes in any array cannot exceed

*– as is the case for many implementations – then it could infer that*

**maxof**(size_t)*dst.lim <=*and we would not need the precondition. Unfortunately, the C standard makes no such promise.

**maxof**(size_t)/**sizeof**(int)The call to *memcpy *involves implicit conversions of its first two parameters from * int** to

*and from*

**void****to*

**const int****. In general, pointer conversions should be avoided when writing for ArC, because such conversions break ArC’s strong type model. However, ArC knows enough about the semantics of*

**const void****memcpy*to prove the 10 verification conditions generated by this version too.

You give an example that does not use loop indexing, and argue that it’s more complex than using loop indexing, so therefore everyone should use loop indexing.

But I’m not convinced. It seems like the code that doesn’t use loop indexing could be made simpler. I don’t think you’ve made the case that this is the best it can get, without loop indexing.

For instance, look at Deputy. See especially Figure 4 of the OSDI 2006 paper, which suggests that the invariants and annotations needed can be substantially less, through two techniques: (a) appropriate design of an invariant language, (b) appropriate inference.

In your specific example, can we do something like this?

I make no promises that I got the above exactly right, since I don’t have a copy of ArC to test this on, but it seems simpler than what you’ve listed.

It also seems plausible that, with a little bit of support for programming with pointer arithmetic, most of these loop invariants could be automatically inferred.

Another comment: In one place, you write: “all we ever do to src is increment it; so it obviously continues to point into the same array as long as it is not over-incremented (which ArC verifies). However, ArC isn’t yet clever enough to spot this and generate an implicit loop invariant.” Well, this suggests to me that your comparison above is not a fair comparison. The fair comparison would be between ‘ArC written in a way that envisions typical idioms using loop indices, used to verify code written with loop indices’ vs ‘ ArC written in a way that envisions typical idioms using pointer arithmetic, used to verify code written with pointer arithmetic’. It seems like right now you are using a tool that has been optimized for code written using loop indices, and then concluding that this tool works better for code that uses loop indices than for code that uses pointer arithmetic. Well, no surprise there. The interesting question is what changes if you extend ArC to recognize typical C idioms of programming using pointer arithmetic and generate the obvious implicit loop invariants for those idioms.

The really interesting comparison would be one where we are dealing with ”-terminated strings, which is the one where the classic C pointer arithmetic idioms really start to look more concise than ones using loop indices. You might try a side-by-side comparison of something like strcpy(), for instance — that would be interesting.

Ultimately, I’d agree with your final conclusion: that using loop indices will often make verification easier than using pointer arithmetic. My quibble is with your argument, rather than your conclusion; I think the argument may have some weak spots in it.

P.S. Do we also need arrayCopy() to have a precondition that src!=NULL and dst!=NULL?

Thanks for your comment. You seem to be suggesting that ArC favours code that doesn’t use pointer arithmetic because that’s what we’ve written it to work with. It’s certainly true that we’ve developed ArC with reference to code that is actually used in critical systems, however I don’t believe that’s an important factor here. Anywhere you compare pointers other than for equality, or subtract one pointer from another, there is the precondition that both point into the same array; otherwise the result is undefined. In my example, this meant that an extra invariant “src.base == (old src).base” was needed. ArC currently doesn’t infer any loop invariants at all (although it often suggests them when proofs fail), so I needed to add this explicitly.

In your suggested code, I’m not sure whether you really meant p and q to have type char* or whether you meant int*. If you use char* then the main invariant is incorrect because an element will sometimes have been partially copied. So I tried the following amended version of your code:

ArC can’t prove this correct until I implement its suggestions for extra loop invariants “p.base == src.base” and “q.base == dst.base”. This gives a total of 8 invariants after splitting on “&&” at the top level, compared to 2 for my original version (or 3 if you include the invariant 0 <= i that I didn't need to make explicit because I declared i as of type size_t).

I agree that suitable design of an invariant langage can reduce the number of annotations needed. In ArC you can declare ghost functions to define your own invariant language, to some extent. If I declare:

then I can condense 6 of the invariants into just 2 that call pointsWithin.

Regarding your PS, ArC doesn't require an explicit precondition that src and dst are not null, because it requires nullable pointers to be qualified with the keyword "null" when they are declared – just as it requires array pointers to be qualified with "array". We decided that strengthening the type system in this way was preferable to writing the "not null" preconditions that would otherwise be needed, because in critical embedded software, pointers are used mostly for parameter passing and null values don't usually make sense.

Thanks. Right, that’s what I intended; sorry about the inadvertent char *s. The invariants indicating that p and src point into the same array (p.base == src.base) seem like they should be straightforward to automatically infer, for a verification tool that includes support for pointer arithmetic. I would expect that the invariants saying that p should remain in-bounds for the array that it points to (i.e., keep(src <= p && p < src+num)) should also be pretty easy to automatically infer.

So with reasonable support for pointer arithmetic, the pointer arithmetic idiom should need 3 loop invariants; with slightly better support for loop indexing, the loop indexing idiom should need 2 loop invariants (slightly better support, to avoid the need to specify i <= num; it seems entirely plausible that this could be automatically inferred). That seems fairly close.